Thread: [PATCH] Initial progress reporting for COPY command

[PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

14 June 2020, 12:32:33

Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist (https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), I have prepared an initial patch for COPY command progress reporting.

Few examples first:

"COPY (SELECT * FROM test) TO '/tmp/ids';"

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
3347126 | 16384 | yr | 0 | TO | t | f | 3529943 | 24906226
(1 row)

"COPY test FROM '/tmp/ids';

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
3347126 | 16384 | yr | 16385 | FROM | t | f | 121591999 | 957218816
(1 row)

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since oid should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known (for example when copying to file, it is 0)
direction - text - one of "FROM" or "TO" depends on command used
file - bool - is file is used?
program - bool - is program used?
lines_processed - bigint - amount of processed lines, works for both directions (FROM/TO)
file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
FROM/TO) when file is used (file = t)

Patch is attached and can be found also at https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I havefew initial notes and questions.

I'm using ftell to get current position in file to populate file_bytes_processed without error handling (ftell can return -1L and also populate errno on problems).

1. Is that a good way to get progress of file processing?
2. Is it safe in given context to not care about errors? If not, what to do on error?

Some columns are not populated on certain COPY commands. For example when a file is not used, file_bytes_processed is set to 0. Would it be better to use NULL instead when the column is not related to the current command? Same problem is for relid column.

I have not found any tests for progress reporting. Are there any? It would need two backends running (one running COPY, one checking output of report view). Is there any similar test I can inspire at? In theory, it should be possible to use dblink_send_query to run async COPY command in the background.

My initial (attached) patch also doesn't introduce documentation for this system view. I can add that later once this patch is finalized (if that happens).

Attachment

copy-progress-v1.patch

Re: [PATCH] Initial progress reporting for COPY command

From

Michael Paquier

Date:

15 June 2020, 00:18:28

Hi Josef,

On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote:
> Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL
> maillist (
>
https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
> I have prepared an initial patch for COPY command progress reporting.

Sounds like a good idea to me.

> I have not found any tests for progress reporting. Are there any? It would
> need two backends running (one running COPY, one checking output of report
> view). Is there any similar test I can inspire at? In theory, it should be
> possible to use dblink_send_query to run async COPY command in the
> background.

We don't have any tests in core.  I think that making deterministic
test cases is rather tricky here as long as we don't have a more
advanced testing framework that allows is to lock certain code paths
and keep around an expected state until a second session comes around
and looks at the progress catalog (even that would need adding more
code to core to mark the extra point looked at).  So I think that it is
fine to not focus on that for this feature.  The important parts are
the choice of the progress points and the data sent to MyProc, and
both should be chosen wisely.

> My initial (attached) patch also doesn't introduce documentation for this
> system view. I can add that later once this patch is finalized (if that
> happens).

You may want to add it to the next commit fest:
https://commitfest.postgresql.org/28/
Documentation is necessary, and having some would ease reviews.
--
Michael

Attachment

signature.asc

Re: [PATCH] Initial progress reporting for COPY command

From

Fujii Masao

Date:

15 June 2020, 04:38:57

On 2020/06/14 21:32, Josef Šimánek wrote:
> Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist
(https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
Ihave prepared an initial patch for COPY command progress reporting.
 

Sounds nice!


> file - bool - is file is used?
> program - bool - is program used?

Are these fields really necessary in a progress view?
What values are reported when STDOUT/STDIN is specified in COPY command?


> file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
> FROM/TO) when file is used (file = t)

What value is reported when STDOUT/STDIN is specified in COPY command?

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

Bharath Rupireddy

Date:

15 June 2020, 05:34:06

> I'm using ftell to get current position in file to populate file_bytes_processed without error handling (ftell can
return-1L and also populate errno on problems).
 
>
> 1. Is that a good way to get progress of file processing?

IMO, it's better to handle the error cases. One possible case where
ftell can return -1 and set errno is when the total bytes processed is
more than LONG_MAX.

Will your patch handle file_bytes_processed reporting for COPY FROM
STDIN cases? For this case, ftell can't be used.

Instead of using ftell and worrying about the errors, a simple
approach could be to have a uint64 variable in CopyStateData to track
the number of bytes read whenever CopyGetData is called. This approach
can also handle the case of COPY FROM STDIN.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

21 June 2020, 11:31:16

po 15. 6. 2020 v 2:18 odesílatel Michael Paquier <michael@paquier.xyz> napsal:

Hi Josef,

On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote:
> Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL
> maillist (
> https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
> I have prepared an initial patch for COPY command progress reporting.

Sounds like a good idea to me.

Great. I will continue working on this.

> I have not found any tests for progress reporting. Are there any? It would
> need two backends running (one running COPY, one checking output of report
> view). Is there any similar test I can inspire at? In theory, it should be
> possible to use dblink_send_query to run async COPY command in the
> background.

We don't have any tests in core. I think that making deterministic
test cases is rather tricky here as long as we don't have a more
advanced testing framework that allows is to lock certain code paths
and keep around an expected state until a second session comes around
and looks at the progress catalog (even that would need adding more
code to core to mark the extra point looked at). So I think that it is
fine to not focus on that for this feature. The important parts are
the choice of the progress points and the data sent to MyProc, and
both should be chosen wisely.

Thanks for the info. I'm focusing exactly at looking for right spots to report the progress. I'll attach new patch with better places and supporting more options of reporting (including STDIN, STDOUT) soon and also I'll try to add it to commitfest.

> My initial (attached) patch also doesn't introduce documentation for this
> system view. I can add that later once this patch is finalized (if that
> happens).

You may want to add it to the next commit fest:
https://commitfest.postgresql.org/28/
Documentation is necessary, and having some would ease reviews.
--
Michael

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

21 June 2020, 11:33:01

po 15. 6. 2020 v 6:39 odesílatel Fujii Masao <masao.fujii@oss.nttdata.com> napsal:

On 2020/06/14 21:32, Josef Šimánek wrote:
> Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist (https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), I have prepared an initial patch for COPY command progress reporting.

Sounds nice!

> file - bool - is file is used?
> program - bool - is program used?

Are these fields really necessary in a progress view?
What values are reported when STDOUT/STDIN is specified in COPY command?

For STDOUT and STDIN file is true and program is false.

> file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
> FROM/TO) when file is used (file = t)

What value is reported when STDOUT/STDIN is specified in COPY command?

For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach new patch soon supporting those as well.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

21 June 2020, 11:33:46

po 15. 6. 2020 v 2:18 odesílatel Michael Paquier <michael@paquier.xyz> napsal:

Hi Josef,

On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote:
> Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL
> maillist (
> https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
> I have prepared an initial patch for COPY command progress reporting.

Sounds like a good idea to me.

> I have not found any tests for progress reporting. Are there any? It would
> need two backends running (one running COPY, one checking output of report
> view). Is there any similar test I can inspire at? In theory, it should be
> possible to use dblink_send_query to run async COPY command in the
> background.

We don't have any tests in core. I think that making deterministic
test cases is rather tricky here as long as we don't have a more
advanced testing framework that allows is to lock certain code paths
and keep around an expected state until a second session comes around
and looks at the progress catalog (even that would need adding more
code to core to mark the extra point looked at). So I think that it is
fine to not focus on that for this feature. The important parts are
the choice of the progress points and the data sent to MyProc, and
both should be chosen wisely.

> My initial (attached) patch also doesn't introduce documentation for this
> system view. I can add that later once this patch is finalized (if that
> happens).

You may want to add it to the next commit fest:
https://commitfest.postgresql.org/28/
Documentation is necessary, and having some would ease reviews.

I have added documentation, more code comments and I'll upload patch to commit fest.

--
Michael

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

21 June 2020, 11:34:29

po 15. 6. 2020 v 7:34 odesílatel Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> napsal:

> I'm using ftell to get current position in file to populate file_bytes_processed without error handling (ftell can return -1L and also populate errno on problems).
>
> 1. Is that a good way to get progress of file processing?

IMO, it's better to handle the error cases. One possible case where
ftell can return -1 and set errno is when the total bytes processed is
more than LONG_MAX.

Will your patch handle file_bytes_processed reporting for COPY FROM
STDIN cases? For this case, ftell can't be used.

Instead of using ftell and worrying about the errors, a simple
approach could be to have a uint64 variable in CopyStateData to track
the number of bytes read whenever CopyGetData is called. This approach
can also handle the case of COPY FROM STDIN.

Thanks for suggestion. I used this approach and latest patch supports both STDIN and STDOUT now.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

21 June 2020, 11:40:34

Thanks for all comments. I have updated code to support more options (including STDIN/STDOUT) and added some documentation.

Patch is attached and can be found also at https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I'm also attaching screenshot of HTML documentation and html documentation file.

I'll do my best to get this to commitfest now.

ne 14. 6. 2020 v 14:32 odesílatel Josef Šimánek <josef.simanek@gmail.com> napsal:

Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist (https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), I have prepared an initial patch for COPY command progress reporting.

Few examples first:

"COPY (SELECT * FROM test) TO '/tmp/ids';"

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
3347126 | 16384 | yr | 0 | TO | t | f | 3529943 | 24906226
(1 row)

"COPY test FROM '/tmp/ids';

yr=# SELECT * from pg_stat_progress_copy;
pid | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
3347126 | 16384 | yr | 16385 | FROM | t | f | 121591999 | 957218816
(1 row)

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since oid should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known (for example when copying to file, it is 0)
direction - text - one of "FROM" or "TO" depends on command used
file - bool - is file is used?
program - bool - is program used?
lines_processed - bigint - amount of processed lines, works for both directions (FROM/TO)
file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
FROM/TO) when file is used (file = t)

Patch is attached and can be found also at https://github.com/simi/postgres/pull/5.

Diff version: https://github.com/simi/postgres/pull/5.diff
Patch version: https://github.com/simi/postgres/pull/5.patch

I havefew initial notes and questions.

I'm using ftell to get current position in file to populate file_bytes_processed without error handling (ftell can return -1L and also populate errno on problems).

1. Is that a good way to get progress of file processing?
2. Is it safe in given context to not care about errors? If not, what to do on error?

Some columns are not populated on certain COPY commands. For example when a file is not used, file_bytes_processed is set to 0. Would it be better to use NULL instead when the column is not related to the current command? Same problem is for relid column.

I have not found any tests for progress reporting. Are there any? It would need two backends running (one running COPY, one checking output of report view). Is there any similar test I can inspire at? In theory, it should be possible to use dblink_send_query to run async COPY command in the background.

My initial (attached) patch also doesn't introduce documentation for this system view. I can add that later once this patch is finalized (if that happens).

Attachment

Re: [PATCH] Initial progress reporting for COPY command

From

Fujii Masao

Date:

22 June 2020, 02:48:40

On 2020/06/21 20:33, Josef Šimánek wrote:
> 
> 
> po 15. 6. 2020 v 6:39 odesílatel Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>
napsal:
> 
> 
> 
>     On 2020/06/14 21:32, Josef Šimánek wrote:
>      > Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist
(https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
Ihave prepared an initial patch for COPY command progress reporting.
 
> 
>     Sounds nice!
> 
> 
>      > file - bool - is file is used?
>      > program - bool - is program used?
> 
>     Are these fields really necessary in a progress view?
>     What values are reported when STDOUT/STDIN is specified in COPY command?
> 
> 
> For STDOUT and STDIN file is true and program is false.

Could you tell me why these columns are necessary in *progress* view?
If we want to see what copy command is actually running, we can see
pg_stat_activity, instead. For example,

     SELECT pc.*, a.query FROM pg_stat_progress_copy pc, pg_stat_activity a WHERE pc.pid = a.pid;

> 
>      > file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
>      > FROM/TO) when file is used (file = t)
> 
>     What value is reported when STDOUT/STDIN is specified in COPY command?
> 
> 
> For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach new patch soon supporting those as well.

Thanks for the patch!

With the patch, pg_stat_progress_copy seems to report the progress of
the processing on file_fdw. Is this intentional?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

vignesh C

Date:

22 June 2020, 07:15:19

On Sun, Jun 21, 2020 at 5:11 PM Josef Šimánek <josef.simanek@gmail.com> wrote:
>
> Thanks for all comments. I have updated code to support more options (including STDIN/STDOUT) and added some
documentation.
>
> Patch is attached and can be found also at https://github.com/simi/postgres/pull/5.
>
> Diff version: https://github.com/simi/postgres/pull/5.diff
> Patch version: https://github.com/simi/postgres/pull/5.patch
>
> I'm also attaching screenshot of HTML documentation and html documentation file.
>
> I'll do my best to get this to commitfest now.
>
> ne 14. 6. 2020 v 14:32 odesílatel Josef Šimánek <josef.simanek@gmail.com> napsal:
>>
>> Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist
(https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
Ihave prepared an initial patch for COPY command progress reporting. 
>>
>> Few examples first:
>>
>> "COPY (SELECT * FROM test) TO '/tmp/ids';"
>>
>> yr=# SELECT * from pg_stat_progress_copy;
>>    pid   | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
>> ---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
>>  3347126 | 16384 | yr      |     0 | TO        | t    | f       |         3529943 |             24906226
>> (1 row)
>>
>> "COPY test FROM '/tmp/ids';
>>
>> yr=# SELECT * from pg_stat_progress_copy;
>>    pid   | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
>> ---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
>>  3347126 | 16384 | yr      | 16385 | FROM      | t    | f       |       121591999 |            957218816
>> (1 row)
>>
>> Columns are inspired by CREATE INDEX progress report system view.
>>
>> pid - integer - PID of backend
>> datid - oid - OID of related database
>> datname - name - name of related database (this seems redundant, since oid should be enough, but it is the same in
CREATEINDEX) 
>> relid - oid - oid of table related to COPY command, when not known (for example when copying to file, it is 0)
>> direction - text - one of "FROM" or "TO" depends on command used
>> file - bool - is file is used?
>> program - bool - is program used?
>> lines_processed - bigint - amount of processed lines, works for both directions (FROM/TO)
>> file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
>> FROM/TO) when file is used (file = t)
>>
>> Patch is attached and can be found also at https://github.com/simi/postgres/pull/5.
>>

Few comments:
@@ -713,6 +714,8 @@ CopyGetData(CopyState cstate, void *databuf, int
minread, int maxread)
  break;
  }

+ CopyUpdateBytesProgress(cstate, bytesread);
+
  return bytesread;
 }

This is actually the read data, actual processing will happen later
like in CopyReadLineText, it would be better if
CopyUpdateBytesProgress is done later, if not it will give the same
value even though it does multiple inserts on the table.
lines_processed will keep getting updated but file_bytes_processed
will not be updated.

 +pg_stat_progress_copy| SELECT s.pid,
+    s.datid,
+    d.datname,
+    s.relid,
+        CASE s.param1
+            WHEN 0 THEN 'TO'::text
+            WHEN 1 THEN 'FROM'::text
+            ELSE NULL::text
+        END AS direction,
+    ((s.param2)::integer)::boolean AS file,
+    ((s.param3)::integer)::boolean AS program,
+    s.param4 AS lines_processed,
+    s.param5 AS file_bytes_processed

You could include pg_size_pretty for s.param5 like
pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
users to understand bytes_processed when the data size increases.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

22 June 2020, 08:21:13

po 22. 6. 2020 v 4:48 odesílatel Fujii Masao <masao.fujii@oss.nttdata.com> napsal:

On 2020/06/21 20:33, Josef Šimánek wrote:
>
>
> po 15. 6. 2020 v 6:39 odesílatel Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> napsal:
>
>
>
> On 2020/06/14 21:32, Josef Šimánek wrote:
> > Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist (https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), I have prepared an initial patch for COPY command progress reporting.
>
> Sounds nice!
>
>
> > file - bool - is file is used?
> > program - bool - is program used?
>
> Are these fields really necessary in a progress view?
> What values are reported when STDOUT/STDIN is specified in COPY command?
>
>
> For STDOUT and STDIN file is true and program is false.

Could you tell me why these columns are necessary in *progress* view?
If we want to see what copy command is actually running, we can see
pg_stat_activity, instead. For example,

SELECT pc.*, a.query FROM pg_stat_progress_copy pc, pg_stat_activity a WHERE pc.pid = a.pid;

If that doesn't make any sense, I can remove those. I have not strong opinion about those values. Those were just around when I was looking for possible values to include in the progress report.

>
> > file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
> > FROM/TO) when file is used (file = t)
>
> What value is reported when STDOUT/STDIN is specified in COPY command?
>
>
> For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach new patch soon supporting those as well.

Thanks for the patch!

With the patch, pg_stat_progress_copy seems to report the progress of
the processing on file_fdw. Is this intentional?

Every action using internally COPY will be included in the progress report view.

I have spotted for example pg_dump does that and is reported there as well.

I do not see any problem regarding this. For pg_dump it is consistent with "pg_stat_activity" reporting COPY command in the query field.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

22 June 2020, 10:57:53

po 22. 6. 2020 v 9:15 odesílatel vignesh C <vignesh21@gmail.com> napsal:

On Sun, Jun 21, 2020 at 5:11 PM Josef Šimánek <josef.simanek@gmail.com> wrote:
>
> Thanks for all comments. I have updated code to support more options (including STDIN/STDOUT) and added some documentation.
>
> Patch is attached and can be found also at https://github.com/simi/postgres/pull/5.
>
> Diff version: https://github.com/simi/postgres/pull/5.diff
> Patch version: https://github.com/simi/postgres/pull/5.patch
>
> I'm also attaching screenshot of HTML documentation and html documentation file.
>
> I'll do my best to get this to commitfest now.
>
> ne 14. 6. 2020 v 14:32 odesílatel Josef Šimánek <josef.simanek@gmail.com> napsal:
>>
>> Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist (https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer), I have prepared an initial patch for COPY command progress reporting.
>>
>> Few examples first:
>>
>> "COPY (SELECT * FROM test) TO '/tmp/ids';"
>>
>> yr=# SELECT * from pg_stat_progress_copy;
>> pid | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
>> ---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
>> 3347126 | 16384 | yr | 0 | TO | t | f | 3529943 | 24906226
>> (1 row)
>>
>> "COPY test FROM '/tmp/ids';
>>
>> yr=# SELECT * from pg_stat_progress_copy;
>> pid | datid | datname | relid | direction | file | program | lines_processed | file_bytes_processed
>> ---------+-------+---------+-------+-----------+------+---------+-----------------+----------------------
>> 3347126 | 16384 | yr | 16385 | FROM | t | f | 121591999 | 957218816
>> (1 row)
>>
>> Columns are inspired by CREATE INDEX progress report system view.
>>
>> pid - integer - PID of backend
>> datid - oid - OID of related database
>> datname - name - name of related database (this seems redundant, since oid should be enough, but it is the same in CREATE INDEX)
>> relid - oid - oid of table related to COPY command, when not known (for example when copying to file, it is 0)
>> direction - text - one of "FROM" or "TO" depends on command used
>> file - bool - is file is used?
>> program - bool - is program used?
>> lines_processed - bigint - amount of processed lines, works for both directions (FROM/TO)
>> file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both direction (
>> FROM/TO) when file is used (file = t)
>>
>> Patch is attached and can be found also at https://github.com/simi/postgres/pull/5.
>>

Few comments:
@@ -713,6 +714,8 @@ CopyGetData(CopyState cstate, void *databuf, int
minread, int maxread)
break;
}

+ CopyUpdateBytesProgress(cstate, bytesread);
+
return bytesread;
}

This is actually the read data, actual processing will happen later
like in CopyReadLineText, it would be better if
CopyUpdateBytesProgress is done later, if not it will give the same
value even though it does multiple inserts on the table.
lines_processed will keep getting updated but file_bytes_processed
will not be updated.

First I would like to explain what's reported (or at least I'm trying to get reported) at bytes_processed column.

When exporting to file it should start at 0 and end up at the actual final file size.

When importing from file, it should do the same. You can check file size before you start COPY FROM and get actual progress looking at bytes_processed.

This column is just a counter of bytes read from input on COPY FROM or amount of bytes going through COPY TO.

Thanks for the hint regarding "CopyReadLineText". I'll take a look.

For now I have tested those cases:

CREATE TABLE test(id int);

INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
COPY (SELECT * FROM test) TO '/tmp/ids';

COPY test FROM '/tmp/ids';

psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000)) TO STDOUT;' > /tmp/ryba.txt

echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'

It is easy to check lines count and bytes count are in sync (since 1 line is 2 bytes here - "1" and newline character).

I'll try to check more complex COPY commands to ensure everything is in sync.

If you have any ideas for testing queries, feel free to suggest.

+pg_stat_progress_copy| SELECT s.pid,
+ s.datid,
+ d.datname,
+ s.relid,
+ CASE s.param1
+ WHEN 0 THEN 'TO'::text
+ WHEN 1 THEN 'FROM'::text
+ ELSE NULL::text
+ END AS direction,
+ ((s.param2)::integer)::boolean AS file,
+ ((s.param3)::integer)::boolean AS program,
+ s.param4 AS lines_processed,
+ s.param5 AS file_bytes_processed

You could include pg_size_pretty for s.param5 like
pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
users to understand bytes_processed when the data size increases.

I was looking at the rest of reporting views and for me those seem to be just basic ones providing just raw data to be used later in custom nice friendly human-readable views built on the client side.

For example "pg_stat_progress_basebackup" also reports "backup_streamed" in raw form.

Anyway if you would like to make this view more user-friendly, I can add that. Just ping me.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Initial progress reporting for COPY command

From

Tomas Vondra

Date:

22 June 2020, 12:14:45

On Sun, Jun 21, 2020 at 01:40:34PM +0200, Josef Šimánek wrote:
>Thanks for all comments. I have updated code to support more options
>(including STDIN/STDOUT) and added some documentation.
>
>Patch is attached and can be found also at
>https://github.com/simi/postgres/pull/5.
>
>Diff version: https://github.com/simi/postgres/pull/5.diff
>Patch version: https://github.com/simi/postgres/pull/5.patch
>
>I'm also attaching screenshot of HTML documentation and html documentation
>file.
>
>I'll do my best to get this to commitfest now.
>

I see we're not showing the total number of bytes the COPY is expected
to process, which makes it hard to estimate how far we actually are.
Clearly there are cases when we really don't know that (exports, import
from stdin/program), but why not to show file size for imports from a
file? I'd expect that to be the most common case.

I wonder if it made sense to show some estimates in the other cases. For
example when exporting query result, maybe we could show the estimated
number of rows and size? Of course, that's prone to estimation errors
and it's more a wild idea for the future, I don't expect this patch to
implement that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

22 June 2020, 13:33:00

po 22. 6. 2020 v 14:14 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:

On Sun, Jun 21, 2020 at 01:40:34PM +0200, Josef Šimánek wrote:
>Thanks for all comments. I have updated code to support more options
>(including STDIN/STDOUT) and added some documentation.
>
>Patch is attached and can be found also at
>https://github.com/simi/postgres/pull/5.
>
>Diff version: https://github.com/simi/postgres/pull/5.diff
>Patch version: https://github.com/simi/postgres/pull/5.patch
>
>I'm also attaching screenshot of HTML documentation and html documentation
>file.
>
>I'll do my best to get this to commitfest now.
>

I see we're not showing the total number of bytes the COPY is expected
to process, which makes it hard to estimate how far we actually are.
Clearly there are cases when we really don't know that (exports, import
from stdin/program), but why not to show file size for imports from a
file? I'd expect that to be the most common case.

For COPY FROM file fstat is done and info is available already at https://github.com/postgres/postgres/blob/fe186b4c200b76a5c0f03379fe8645ed1c70a844/src/backend/commands/copy.c#L1934. It should be easy to update some param (param6 for example) with file size and expose it in report view. When not available, this column can be NULL.

Would that be enough?

On the other side everyone can check file size manually to get total value expected and just compare to reported bytes_processed. Alt. "wc -l" can be checked to get amount of lines and check lines_processed column to get progress. Should it check amount of lines and populate another column with lines total (using a configured separator) as well? AFAIK that would need full file scan which can be slow for huge files.

I wonder if it made sense to show some estimates in the other cases. For
example when exporting query result, maybe we could show the estimated
number of rows and size? Of course, that's prone to estimation errors
and it's more a wild idea for the future, I don't expect this patch to
implement that.

My plan here was to expose numbers not being currently available and let clients get the rest of info on their own.

For example:

- for "COPY (query) TO file" - EXPLAIN or COUNT variant of query could be executed before to get the amount of expected rows

- for "COPY table FROM file" - file size or amount of lines in file can be inspected first to get amount of expected rows or bytes to be processed

I see the current system view in my patch (and also all other report views currently available) more as a scaffold to build own tools.

For example CLI tools can use this to provide some kind of progress.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Initial progress reporting for COPY command

From

Tomas Vondra

Date:

22 June 2020, 14:19:00

On Mon, Jun 22, 2020 at 03:33:00PM +0200, Josef Šimánek wrote:
>po 22. 6. 2020 v 14:14 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
>napsal:
>
>> On Sun, Jun 21, 2020 at 01:40:34PM +0200, Josef Šimánek wrote:
>> >Thanks for all comments. I have updated code to support more options
>> >(including STDIN/STDOUT) and added some documentation.
>> >
>> >Patch is attached and can be found also at
>> >https://github.com/simi/postgres/pull/5.
>> >
>> >Diff version: https://github.com/simi/postgres/pull/5.diff
>> >Patch version: https://github.com/simi/postgres/pull/5.patch
>> >
>> >I'm also attaching screenshot of HTML documentation and html documentation
>> >file.
>> >
>> >I'll do my best to get this to commitfest now.
>> >
>>
>> I see we're not showing the total number of bytes the COPY is expected
>> to process, which makes it hard to estimate how far we actually are.
>> Clearly there are cases when we really don't know that (exports, import
>> from stdin/program), but why not to show file size for imports from a
>> file? I'd expect that to be the most common case.
>>
>
>For COPY FROM file fstat is done and info is available already at
>https://github.com/postgres/postgres/blob/fe186b4c200b76a5c0f03379fe8645ed1c70a844/src/backend/commands/copy.c#L1934.
>It should be easy to update some param (param6 for example) with file size
>and expose it in report view. When not available, this column can be NULL.
>
>Would that be enough?
>

Yes, I think that'd be fine. The rows without a file should have NULL,
because we literally don't know what the value is. And 0 is a valid file
size, so we can't use it anyway.

>On the other side everyone can check file size manually to get total value
>expected and just compare to reported bytes_processed. Alt. "wc -l" can be
>checked to get amount of lines and check lines_processed column to get
>progress. Should it check amount of lines and populate another column with
>lines total (using a configured separator) as well? AFAIK that would need
>full file scan which can be slow for huge files.
>

Sure, but the extra `wc -l` is less convenient and you then need to
combine that with pg_stat_progress_copy. With the information right in
the view, you can do (100.0 * bytes_processed / bytes_total) and you get
the progress as a percentage. (I've omitted the NULL handling.)

As for the number of lines, I certainly don't think we need to scan the
file - that'd be far too expensive. What we might do is estimate it as

    total_bytes / (processed_bytes / processed_rows)

but that's something people can easily do on their own. So I don't think
it needs to be part of the patch, and IMHO bytes_processed / bytes_total
is a sufficient measure of progress.

>
>> I wonder if it made sense to show some estimates in the other cases. For
>> example when exporting query result, maybe we could show the estimated
>> number of rows and size? Of course, that's prone to estimation errors
>> and it's more a wild idea for the future, I don't expect this patch to
>> implement that.
>>
>
>My plan here was to expose numbers not being currently available and let
>clients get the rest of info on their own.
>
>For example:
>- for "COPY (query) TO file" - EXPLAIN or COUNT variant of query could be
>executed before to get the amount of expected rows
>- for "COPY table FROM file" - file size or amount of lines in file can be
>inspected first to get amount of expected rows or bytes to be processed
>
>I see the current system view in my patch (and also all other report views
>currently available) more as a scaffold to build own tools.
>
>For example CLI tools can use this to provide some kind of progress.
>

True, but I'd advise against putting this into v1 of the patch. Let's
keep it simple, get it committed and then maybe improve it later.

Some of these stats (like the estimates from a query) may be quite
unreliable, so I think it needs more discussion. We might invent
lines_estimated or something like that, for example.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Initial progress reporting for COPY command

From

vignesh C

Date:

23 June 2020, 10:10:08

On Mon, Jun 22, 2020 at 4:28 PM Josef Šimánek <josef.simanek@gmail.com> wrote:
>
> Thanks for the hint regarding "CopyReadLineText". I'll take a look.
>
> For now I have tested those cases:
>
> CREATE TABLE test(id int);
> INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
> COPY (SELECT * FROM test) TO '/tmp/ids';
> COPY test FROM '/tmp/ids';
>
> psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000)) TO STDOUT;' > /tmp/ryba.txt
> echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'
>
> It is easy to check lines count and bytes count are in sync (since 1 line is 2 bytes here - "1" and newline
character).
> I'll try to check more complex COPY commands to ensure everything is in sync.
>
> If you have any ideas for testing queries, feel free to suggest.

For copy from statement you could attach the session, put a breakpoint
at CopyReadLineText, execution will hit this breakpoint for every
record it is doing COPY FROM and parallely check if
pg_stat_progress_copy is getting updated correctly. I noticed it was
showing the file read size instead of the actual processed bytes.

>>  +pg_stat_progress_copy| SELECT s.pid,
>> +    s.datid,
>> +    d.datname,
>> +    s.relid,
>> +        CASE s.param1
>> +            WHEN 0 THEN 'TO'::text
>> +            WHEN 1 THEN 'FROM'::text
>> +            ELSE NULL::text
>> +        END AS direction,
>> +    ((s.param2)::integer)::boolean AS file,
>> +    ((s.param3)::integer)::boolean AS program,
>> +    s.param4 AS lines_processed,
>> +    s.param5 AS file_bytes_processed
>>
>> You could include pg_size_pretty for s.param5 like
>> pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
>> users to understand bytes_processed when the data size increases.
>
>
> I was looking at the rest of reporting views and for me those seem to be just basic ones providing just raw data to
beused later in custom nice friendly human-readable views built on the client side. 
> For example "pg_stat_progress_basebackup" also reports "backup_streamed" in raw form.
>
> Anyway if you would like to make this view more user-friendly, I can add that. Just ping me.

I felt we could add pg_size_pretty to make the view more user friendly.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Initial progress reporting for COPY command

From

Tomas Vondra

Date:

23 June 2020, 11:15:19

On Tue, Jun 23, 2020 at 03:40:08PM +0530, vignesh C wrote:
>On Mon, Jun 22, 2020 at 4:28 PM Josef Šimánek <josef.simanek@gmail.com> wrote:
>>
>> Thanks for the hint regarding "CopyReadLineText". I'll take a look.
>>
>> For now I have tested those cases:
>>
>> CREATE TABLE test(id int);
>> INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
>> COPY (SELECT * FROM test) TO '/tmp/ids';
>> COPY test FROM '/tmp/ids';
>>
>> psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000)) TO STDOUT;' > /tmp/ryba.txt
>> echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'
>>
>> It is easy to check lines count and bytes count are in sync (since 1 line is 2 bytes here - "1" and newline
character).
>> I'll try to check more complex COPY commands to ensure everything is in sync.
>>
>> If you have any ideas for testing queries, feel free to suggest.
>
>For copy from statement you could attach the session, put a breakpoint
>at CopyReadLineText, execution will hit this breakpoint for every
>record it is doing COPY FROM and parallely check if
>pg_stat_progress_copy is getting updated correctly. I noticed it was
>showing the file read size instead of the actual processed bytes.
>
>>>  +pg_stat_progress_copy| SELECT s.pid,
>>> +    s.datid,
>>> +    d.datname,
>>> +    s.relid,
>>> +        CASE s.param1
>>> +            WHEN 0 THEN 'TO'::text
>>> +            WHEN 1 THEN 'FROM'::text
>>> +            ELSE NULL::text
>>> +        END AS direction,
>>> +    ((s.param2)::integer)::boolean AS file,
>>> +    ((s.param3)::integer)::boolean AS program,
>>> +    s.param4 AS lines_processed,
>>> +    s.param5 AS file_bytes_processed
>>>
>>> You could include pg_size_pretty for s.param5 like
>>> pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
>>> users to understand bytes_processed when the data size increases.
>>
>>
>> I was looking at the rest of reporting views and for me those seem to be just basic ones providing just raw data to
beused later in custom nice friendly human-readable views built on the client side.
 
>> For example "pg_stat_progress_basebackup" also reports "backup_streamed" in raw form.
>>
>> Anyway if you would like to make this view more user-friendly, I can add that. Just ping me.
>
>I felt we could add pg_size_pretty to make the view more user friendly.
>

Please no. That'd make processing of the data (say, computing progress
as processed/total) impossible. It's easy to add pg_size_pretty if you
want it, it's impossible to undo it. I don't see a single pg_size_pretty
call in system_views.sql.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Initial progress reporting for COPY command

From

Pavel Stehule

Date:

23 June 2020, 12:02:45

út 23. 6. 2020 v 13:15 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:

On Tue, Jun 23, 2020 at 03:40:08PM +0530, vignesh C wrote:
>On Mon, Jun 22, 2020 at 4:28 PM Josef Šimánek <josef.simanek@gmail.com> wrote:
>>
>> Thanks for the hint regarding "CopyReadLineText". I'll take a look.
>>
>> For now I have tested those cases:
>>
>> CREATE TABLE test(id int);
>> INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
>> COPY (SELECT * FROM test) TO '/tmp/ids';
>> COPY test FROM '/tmp/ids';
>>
>> psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000)) TO STDOUT;' > /tmp/ryba.txt
>> echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'
>>
>> It is easy to check lines count and bytes count are in sync (since 1 line is 2 bytes here - "1" and newline character).
>> I'll try to check more complex COPY commands to ensure everything is in sync.
>>
>> If you have any ideas for testing queries, feel free to suggest.
>
>For copy from statement you could attach the session, put a breakpoint
>at CopyReadLineText, execution will hit this breakpoint for every
>record it is doing COPY FROM and parallely check if
>pg_stat_progress_copy is getting updated correctly. I noticed it was
>showing the file read size instead of the actual processed bytes.
>
>>> +pg_stat_progress_copy| SELECT s.pid,
>>> + s.datid,
>>> + d.datname,
>>> + s.relid,
>>> + CASE s.param1
>>> + WHEN 0 THEN 'TO'::text
>>> + WHEN 1 THEN 'FROM'::text
>>> + ELSE NULL::text
>>> + END AS direction,
>>> + ((s.param2)::integer)::boolean AS file,
>>> + ((s.param3)::integer)::boolean AS program,
>>> + s.param4 AS lines_processed,
>>> + s.param5 AS file_bytes_processed
>>>
>>> You could include pg_size_pretty for s.param5 like
>>> pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
>>> users to understand bytes_processed when the data size increases.
>>
>>
>> I was looking at the rest of reporting views and for me those seem to be just basic ones providing just raw data to be used later in custom nice friendly human-readable views built on the client side.
>> For example "pg_stat_progress_basebackup" also reports "backup_streamed" in raw form.
>>
>> Anyway if you would like to make this view more user-friendly, I can add that. Just ping me.
>
>I felt we could add pg_size_pretty to make the view more user friendly.
>

Please no. That'd make processing of the data (say, computing progress
as processed/total) impossible. It's easy to add pg_size_pretty if you
want it, it's impossible to undo it. I don't see a single pg_size_pretty
call in system_views.sql.

+1, *_pretty functions should be used on the client side only. Server side (source) should be in raw format.

Regards

Pavel

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

23 June 2020, 12:22:07

út 23. 6. 2020 v 13:15 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:

On Tue, Jun 23, 2020 at 03:40:08PM +0530, vignesh C wrote:
>On Mon, Jun 22, 2020 at 4:28 PM Josef Šimánek <josef.simanek@gmail.com> wrote:
>>
>> Thanks for the hint regarding "CopyReadLineText". I'll take a look.
>>
>> For now I have tested those cases:
>>
>> CREATE TABLE test(id int);
>> INSERT INTO test SELECT 1 FROM generate_series(1, 1000000);
>> COPY (SELECT * FROM test) TO '/tmp/ids';
>> COPY test FROM '/tmp/ids';
>>
>> psql -h /tmp yr -c 'COPY (SELECT 1 from generate_series(1,100000000)) TO STDOUT;' > /tmp/ryba.txt
>> echo /tmp/ryba.txt | psql -h /tmp yr -c 'COPY test FROM STDIN'
>>
>> It is easy to check lines count and bytes count are in sync (since 1 line is 2 bytes here - "1" and newline character).
>> I'll try to check more complex COPY commands to ensure everything is in sync.
>>
>> If you have any ideas for testing queries, feel free to suggest.
>
>For copy from statement you could attach the session, put a breakpoint
>at CopyReadLineText, execution will hit this breakpoint for every
>record it is doing COPY FROM and parallely check if
>pg_stat_progress_copy is getting updated correctly. I noticed it was
>showing the file read size instead of the actual processed bytes.
>
>>> +pg_stat_progress_copy| SELECT s.pid,
>>> + s.datid,
>>> + d.datname,
>>> + s.relid,
>>> + CASE s.param1
>>> + WHEN 0 THEN 'TO'::text
>>> + WHEN 1 THEN 'FROM'::text
>>> + ELSE NULL::text
>>> + END AS direction,
>>> + ((s.param2)::integer)::boolean AS file,
>>> + ((s.param3)::integer)::boolean AS program,
>>> + s.param4 AS lines_processed,
>>> + s.param5 AS file_bytes_processed
>>>
>>> You could include pg_size_pretty for s.param5 like
>>> pg_size_pretty(S.param5) AS bytes_processed, it will be easier for
>>> users to understand bytes_processed when the data size increases.
>>
>>
>> I was looking at the rest of reporting views and for me those seem to be just basic ones providing just raw data to be used later in custom nice friendly human-readable views built on the client side.
>> For example "pg_stat_progress_basebackup" also reports "backup_streamed" in raw form.
>>
>> Anyway if you would like to make this view more user-friendly, I can add that. Just ping me.
>
>I felt we could add pg_size_pretty to make the view more user friendly.
>

Please no. That'd make processing of the data (say, computing progress
as processed/total) impossible. It's easy to add pg_size_pretty if you
want it, it's impossible to undo it. I don't see a single pg_size_pretty
call in system_views.sql.

I think the same.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Initial progress reporting for COPY command

From

Bharath Rupireddy

Date:

23 June 2020, 12:52:01

> po 15. 6. 2020 v 7:34 odesílatel Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> napsal:
>>
>> > I'm using ftell to get current position in file to populate file_bytes_processed without error handling (ftell can
return-1L and also populate errno on problems). 
>> >
>> > 1. Is that a good way to get progress of file processing?
>>
>> IMO, it's better to handle the error cases. One possible case where
>> ftell can return -1 and set errno is when the total bytes processed is
>> more than LONG_MAX.
>>
>> Will your patch handle file_bytes_processed reporting for COPY FROM
>> STDIN cases? For this case, ftell can't be used.
>>
>> Instead of using ftell and worrying about the errors, a simple
>> approach could be to have a uint64 variable in CopyStateData to track
>> the number of bytes read whenever CopyGetData is called. This approach
>> can also handle the case of COPY FROM STDIN.
>
>
> Thanks for suggestion. I used this approach and latest patch supports both STDIN and STDOUT now.
>

Thanks.

It would be good to see the performance of the copy command(probably
with a few GBs of data) with patch and without patch for both csv/text
and binary files.

For copy from command CopyGetData gets called for every
RAW_BUF_SIZE(64KB) and so is CopyUpdateBytesProgress function, but for
binary format files, CopyGetData gets called for each field/column for
all rows/lines/tuples.

Can we make CopyUpdateBytesProgress() a macro or an inline
function(probably by using pg_attribute_always_inline) to reduce
function call overhead as it just handles two statements?

I tried to apply the patch on commit #
7ce461560159948ba0c802c767e42c5f5ae08b4a, seems like a warning.

bharath:postgres$ git apply /mnt/hgfs/Downloads/copy-progress-v2.diff
/mnt/hgfs/Downloads/copy-progress-v2.diff:277: trailing whitespace.
                         * for counting tuples inserted by an INSERT
command. Update
warning: 1 line adds whitespace errors.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Initial progress reporting for COPY command

From

Fujii Masao

Date:

23 June 2020, 17:57:31

On 2020/06/22 17:21, Josef Šimánek wrote:
> 
> 
> po 22. 6. 2020 v 4:48 odesílatel Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>>
napsal:
> 
> 
> 
>     On 2020/06/21 20:33, Josef Šimánek wrote:
>      >
>      >
>      > po 15. 6. 2020 v 6:39 odesílatel Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>
<mailto:masao.fujii@oss.nttdata.com<mailto:masao.fujii@oss.nttdata.com>>> napsal:
 
>      >
>      >
>      >
>      >     On 2020/06/14 21:32, Josef Šimánek wrote:
>      >      > Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL maillist
(https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
Ihave prepared an initial patch for COPY command progress reporting.
 
>      >
>      >     Sounds nice!
>      >
>      >
>      >      > file - bool - is file is used?
>      >      > program - bool - is program used?
>      >
>      >     Are these fields really necessary in a progress view?
>      >     What values are reported when STDOUT/STDIN is specified in COPY command?
>      >
>      >
>      > For STDOUT and STDIN file is true and program is false.
> 
>     Could you tell me why these columns are necessary in *progress* view?
>     If we want to see what copy command is actually running, we can see
>     pg_stat_activity, instead. For example,
> 
>           SELECT pc.*, a.query FROM pg_stat_progress_copy pc, pg_stat_activity a WHERE pc.pid = a.pid;
> 
> If that doesn't make any sense, I can remove those. I have not strong opinion about those values. Those were just
aroundwhen I was looking for possible values to include in the progress report.
 

I vote not to expose them. *If* we expose them, we should also
expose the options in pg_stat_progress_xxx views, for example,
the options for BASE_BACKUP command in pg_stat_progress_basebackup,
for the consistency. But I don't think that makes sense.

> 
>      >
>      >      > file_bytes_processed - amount of bytes processed when file is used (otherwise 0), works for both
direction(
 
>      >      > FROM/TO) when file is used (file = t)
>      >
>      >     What value is reported when STDOUT/STDIN is specified in COPY command?
>      >
>      >
>      > For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach new patch soon supporting those as
well.
> 
>     Thanks for the patch!
> 
>     With the patch, pg_stat_progress_copy seems to report the progress of
>     the processing on file_fdw. Is this intentional?
> 
> 
> Every action using internally COPY will be included in the progress report view.
> I have spotted for example pg_dump does that and is reported there as well.
> I do not see any problem regarding this. For pg_dump it is consistent with "pg_stat_activity" reporting COPY command
inthe query field.
 

So it's better to add this kind of information into the docs?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

Fujii Masao

Date:

23 June 2020, 18:17:13

On 2020/06/22 21:14, Tomas Vondra wrote:
> On Sun, Jun 21, 2020 at 01:40:34PM +0200, Josef Šimánek wrote:
>> Thanks for all comments. I have updated code to support more options
>> (including STDIN/STDOUT) and added some documentation.
>>
>> Patch is attached and can be found also at
>> https://github.com/simi/postgres/pull/5.
>>
>> Diff version: https://github.com/simi/postgres/pull/5.diff
>> Patch version: https://github.com/simi/postgres/pull/5.patch
>>
>> I'm also attaching screenshot of HTML documentation and html documentation
>> file.
>>
>> I'll do my best to get this to commitfest now.
>>
> 
> I see we're not showing the total number of bytes the COPY is expected
> to process, which makes it hard to estimate how far we actually are.
> Clearly there are cases when we really don't know that (exports, import
> from stdin/program), but why not to show file size for imports from a
> file? I'd expect that to be the most common case.

+1

I like using \copy psql meta command. So I feel better if the total size
is reported even when using \copy (i.e., COPY STDIN).

As just idea, what about adding new option into COPY command,
allowing users (including \copy command) to specify the estimated size
of input file in that option, and making pg_stat_progress_copy view
display it as the total size? If we implement this mechanism, we can
change \copy command so that it calculate the actual size of input file
and specify it in that option.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

vignesh C

Date:

25 June 2020, 01:05:22

On Tue, Jun 23, 2020 at 4:45 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> >>
> >> Anyway if you would like to make this view more user-friendly, I can add that. Just ping me.
> >
> >I felt we could add pg_size_pretty to make the view more user friendly.
> >
>
> Please no. That'd make processing of the data (say, computing progress
> as processed/total) impossible. It's easy to add pg_size_pretty if you
> want it, it's impossible to undo it. I don't see a single pg_size_pretty
> call in system_views.sql.
>

I thought of including pg_size_pretty as we there was no total_bytes
to compare with, but I'm ok without it too as there is an option for
user to always include it in the client side like "SELECT
pg_size_pretty(file_bytes_processed) from pg_stat_progress_copy;" if
required.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Initial progress reporting for COPY command

From

Daniel Gustafsson

Date:

02 July 2020, 12:25:16

The automated patchtester for the Commitfest gets confused when there are two
versions of the same changeset attached to the email, as it tries to apply them
both which obviously results in an application failure.  I've attached just the
previously submitted patch version to this email to see if we can get a test
run of it.

cheers ./daniel

Attachment

copy-progress-v2.patch

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

02 July 2020, 12:42:16

čt 2. 7. 2020 v 14:25 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

The automated patchtester for the Commitfest gets confused when there are two
versions of the same changeset attached to the email, as it tries to apply them
both which obviously results in an application failure. I've attached just the
previously submitted patch version to this email to see if we can get a test
run of it.

Thanks, I'm new to commitfest and I was confused as well. I tried to reattach the thread there. I'll prepare a new patch soon, what should I do? Just attach it again?

cheers ./daniel

Re: [PATCH] Initial progress reporting for COPY command

From

Daniel Gustafsson

Date:

02 July 2020, 12:51:40

> On 2 Jul 2020, at 14:42, Josef Šimánek <josef.simanek@gmail.com> wrote:
> čt 2. 7. 2020 v 14:25 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:

> The automated patchtester for the Commitfest gets confused when there are two
> versions of the same changeset attached to the email, as it tries to apply them
> both which obviously results in an application failure.  I've attached just the
> previously submitted patch version to this email to see if we can get a test
> run of it.
>
> Thanks, I'm new to commitfest and I was confused as well.

No worries, we're here to help.

> I tried to reattach the thread there. I'll prepare a new patch soon, what should I do? Just attach it again?

Correct, just reply to the thread with a new version of the patch attached, and
it'll get picked up automatically. No need to do anything more.

cheers ./daniel

Re: [PATCH] Initial progress reporting for COPY command

From

Fujii Masao

Date:

28 July 2020, 17:00:04

On 2020/07/02 21:51, Daniel Gustafsson wrote:
>> On 2 Jul 2020, at 14:42, Josef Šimánek <josef.simanek@gmail.com> wrote:
>> čt 2. 7. 2020 v 14:25 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:
> 
>> The automated patchtester for the Commitfest gets confused when there are two
>> versions of the same changeset attached to the email, as it tries to apply them
>> both which obviously results in an application failure.  I've attached just the
>> previously submitted patch version to this email to see if we can get a test
>> run of it.
>>
>> Thanks, I'm new to commitfest and I was confused as well.
> 
> No worries, we're here to help.
> 
>> I tried to reattach the thread there. I'll prepare a new patch soon, what should I do? Just attach it again?

New patch has not been sent yet.
So I marked this as "Waiting on Author" at Commit Fest.

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

28 July 2020, 18:24:57

Thanks for the info. I am waiting for review. Is there any summary of requested changes needed?

Dne út 28. 7. 2020 19:00 uživatel Fujii Masao <masao.fujii@oss.nttdata.com> napsal:

On 2020/07/02 21:51, Daniel Gustafsson wrote:
>> On 2 Jul 2020, at 14:42, Josef Šimánek <josef.simanek@gmail.com> wrote:
>> čt 2. 7. 2020 v 14:25 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:
>
>> The automated patchtester for the Commitfest gets confused when there are two
>> versions of the same changeset attached to the email, as it tries to apply them
>> both which obviously results in an application failure. I've attached just the
>> previously submitted patch version to this email to see if we can get a test
>> run of it.
>>
>> Thanks, I'm new to commitfest and I was confused as well.
>
> No worries, we're here to help.
>
>> I tried to reattach the thread there. I'll prepare a new patch soon, what should I do? Just attach it again?

New patch has not been sent yet.
So I marked this as "Waiting on Author" at Commit Fest.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

Pavel Stehule

Date:

28 July 2020, 18:30:22

út 28. 7. 2020 v 20:25 odesílatel Josef Šimánek <josef.simanek@gmail.com> napsal:

Thanks for the info. I am waiting for review. Is there any summary of requested changes needed?

Maybe it is just noise - you wrote so you will resend a patch to different thread

>
>> I tried to reattach the thread there. I'll prepare a new patch soon, what should I do? Just attach it again?

Regards

Pavel

Dne út 28. 7. 2020 19:00 uživatel Fujii Masao <masao.fujii@oss.nttdata.com> napsal:

On 2020/07/02 21:51, Daniel Gustafsson wrote:
>> On 2 Jul 2020, at 14:42, Josef Šimánek <josef.simanek@gmail.com> wrote:
>> čt 2. 7. 2020 v 14:25 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:
>
>> The automated patchtester for the Commitfest gets confused when there are two
>> versions of the same changeset attached to the email, as it tries to apply them
>> both which obviously results in an application failure. I've attached just the
>> previously submitted patch version to this email to see if we can get a test
>> run of it.
>>
>> Thanks, I'm new to commitfest and I was confused as well.
>
> No worries, we're here to help.
>
>> I tried to reattach the thread there. I'll prepare a new patch soon, what should I do? Just attach it again?

New patch has not been sent yet.
So I marked this as "Waiting on Author" at Commit Fest.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

Fujii Masao

Date:

29 July 2020, 23:51:36

On 2020/07/29 3:30, Pavel Stehule wrote:
> 
> 
> út 28. 7. 2020 v 20:25 odesílatel Josef Šimánek <josef.simanek@gmail.com <mailto:josef.simanek@gmail.com>> napsal:
> 
>     Thanks for the info. I am waiting for review. Is there any summary of requested changes needed?
> 
> 
> Maybe it is just noise - you wrote so you will resend a patch to different thread
> 
>> 
>>> I tried to reattach the thread there. I'll prepare a new patch soon, what should I do? Just attach it again?

Yeah, since I read this message, I was thinking that new patch will be
posted. But, Josef, if the situation was changed, could you correct me?
Anyway the patch seems not to be applied cleanly, so at least it needs to
be updated to address that.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Initial progress reporting for COPY command

From

Michael Paquier

Date:

07 September 2020, 04:13:10

On Thu, Jul 30, 2020 at 08:51:36AM +0900, Fujii Masao wrote:
> Yeah, since I read this message, I was thinking that new patch will be
> posted. But, Josef, if the situation was changed, could you correct me?
> Anyway the patch seems not to be applied cleanly, so at least it needs to
> be updated to address that.

Josef, the patch has been waiting on author for a bit more than one
month, so could you send a rebased version please?  It looks that you
are still a bit confused by the commit fest process, and as a first
step we need a clean version to be able to review it.  This would also
allow the commit fest bot to check it at http://commitfest.cputube.org/.
--
Michael

Attachment

signature.asc

Re: [PATCH] Initial progress reporting for COPY command

From

Michael Paquier

Date:

17 September 2020, 05:09:20

On Mon, Sep 07, 2020 at 01:13:10PM +0900, Michael Paquier wrote:
> Josef, the patch has been waiting on author for a bit more than one
> month, so could you send a rebased version please?  It looks that you
> are still a bit confused by the commit fest process, and as a first
> step we need a clean version to be able to review it.  This would also
> allow the commit fest bot to check it at http://commitfest.cputube.org/.

This feature has some appeal, but there is no activity lately, so I am
marking it as returned with feedback.  Please feel free to send a new
patch once you have time to do so, and I would recommend to register a
new entry in the commit fest app when done.
--
Michael

Attachment

signature.asc

Re: [PATCH] Initial progress reporting for COPY command

From

Josef Šimánek

Date:

17 September 2020, 15:08:46

čt 17. 9. 2020 v 7:09 odesílatel Michael Paquier <michael@paquier.xyz> napsal:
>
> On Mon, Sep 07, 2020 at 01:13:10PM +0900, Michael Paquier wrote:
> > Josef, the patch has been waiting on author for a bit more than one
> > month, so could you send a rebased version please?  It looks that you
> > are still a bit confused by the commit fest process, and as a first
> > step we need a clean version to be able to review it.  This would also
> > allow the commit fest bot to check it at http://commitfest.cputube.org/.
>
> This feature has some appeal, but there is no activity lately, so I am
> marking it as returned with feedback.  Please feel free to send a new
> patch once you have time to do so, and I would recommend to register a
> new entry in the commit fest app when done.

Thanks for info. I hope I'll be able to revisit this patch soon,
rebase and post again. I'm still interested in this.

> --
> Michael