Thread: [PATCH] Simple progress reporting for COPY command

[PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

01 January 2021, 01:25:31

Hello,

finally I had some time to revisit patch and all comments from
https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
and I have prepared simple version of COPY command progress reporting.

To keep the patch small as possible, I have introduced only a minimum
set of columns. It could be extended later if needed.

Columns are inspired by CREATE INDEX progress report system view.

pid - integer - PID of backend
datid - oid - OID of related database
datname - name - name of related database (this seems redundant, since
oid should be enough, but it is the same in CREATE INDEX)
relid - oid - oid of table related to COPY command, when not known
(for example when copying to file, it is 0)
bytes_processed - bigint - amount of bytes processed
bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
COPY FROM file)
lines_processed - bigint - amount of tuples processed

example output of progress for common use case (import from CSV file):

first console:
yr=# COPY test FROM '/home/retro/test.csv' (FORMAT CSV);

second console:
yr=# SELECT * FROM pg_stat_progress_copy;
  pid   | datid | datname | relid | bytes_processed | bytes_total |
lines_processed
--------+-------+---------+-------+-----------------+-------------+-----------------
 803148 | 16384 | yr      | 16394 |       998965248 |  1777777796 |
    56730126
(1 row)

It is simple to get progress in percents for example by:

yr=# SELECT (bytes_processed/bytes_total::decimal)*100 FROM
pg_stat_progress_copy WHERE pid = 803148;
        ?column?
-------------------------
 50.04287948706048525800

^ ~50% of file processed already

I did some dead simple benchmarking as well. The difference is not
huge. Each command works with 100 millions of tuples. Times are in
seconds.

           test             with progress   master (32d6287)   difference
 ------------------------- --------------- ------------------ ------------
  COPY table TO                    46.102             47.499       -1.397
  COPY query TO                    52.168             49.822        2.346
  COPY table TO PROGRAM            52.345             51.882        0.463
  COPY query TO PROGRAM            54.141             52.763        1.378
  COPY table FROM                  88.970             85.161        3.809
  COPY table FROM PROGRAM          94.393             90.346        4.047

Properly formatted table (since I'm not sure everyone here would be
able to see the table formatted well) and the benchmark source is
present at https://github.com/simi/postgres/pull/6. I have also
included an example output in there.

I'll add this to the current commitfest as well.

Attachment

001-copy-progress.patch

Re: [PATCH] Simple progress reporting for COPY command

From

Bharath Rupireddy

Date:

01 January 2021, 10:15:50

On Fri, Jan 1, 2021 at 6:55 AM Josef Šimánek <josef.simanek@gmail.com> wrote:
> finally I had some time to revisit patch and all comments from
> https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
> and I have prepared simple version of COPY command progress reporting.

Thanks for the patch.

> To keep the patch small as possible, I have introduced only a minimum
> set of columns. It could be extended later if needed.
>
> Columns are inspired by CREATE INDEX progress report system view.
>
> pid - integer - PID of backend
> datid - oid - OID of related database
> datname - name - name of related database (this seems redundant, since
> oid should be enough, but it is the same in CREATE INDEX)
> relid - oid - oid of table related to COPY command, when not known
> (for example when copying to file, it is 0)
> bytes_processed - bigint - amount of bytes processed
> bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
> COPY FROM file)

It means that, for COPY FROM STDIN, bytes_total will be 0. IIUC, so
having this parameter as 0 is an indication of the COPY command being
from STDIN? I'm not sure whether it's discussed or not previously, but
I personally prefer to have it as a separate column, say a varchar or
text column with values STDIN, FILE or PROGRAM may be. And also it
will be better if we have the format of the data streaming in, as a
separate column, with possible values may be CSV, TEXT, BINARY.

Since this progress reporting is for both COPY FROM and COPY TO,
wouldn't it be good to have that also as a separate column, say
command type with values FROM and TO?

Thoughts?

I'm sure some of the points would have already been discussed. I just
shared my thoughts here.

I have not looked into the patch in detail, but it's missing tests.
I'm sure we can not add the tests into copy.sql or copy2.sql, can we
think of adding test cases to TAP or under isolation? I'm not sure
whether other progress reporting commands have test cases.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

01 January 2021, 13:00:53

pá 1. 1. 2021 v 11:16 odesílatel Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> napsal:
>
> On Fri, Jan 1, 2021 at 6:55 AM Josef Šimánek <josef.simanek@gmail.com> wrote:
> > finally I had some time to revisit patch and all comments from
> > https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
> > and I have prepared simple version of COPY command progress reporting.
>
> Thanks for the patch.
>
> > To keep the patch small as possible, I have introduced only a minimum
> > set of columns. It could be extended later if needed.
> >
> > Columns are inspired by CREATE INDEX progress report system view.
> >
> > pid - integer - PID of backend
> > datid - oid - OID of related database
> > datname - name - name of related database (this seems redundant, since
> > oid should be enough, but it is the same in CREATE INDEX)
> > relid - oid - oid of table related to COPY command, when not known
> > (for example when copying to file, it is 0)
> > bytes_processed - bigint - amount of bytes processed
> > bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
> > COPY FROM file)
>
> It means that, for COPY FROM STDIN, bytes_total will be 0. IIUC, so
> having this parameter as 0 is an indication of the COPY command being
> from STDIN? I'm not sure whether it's discussed or not previously, but
> I personally prefer to have it as a separate column, say a varchar or
> text column with values STDIN, FILE or PROGRAM may be. And also it
> will be better if we have the format of the data streaming in, as a
> separate column, with possible values may be CSV, TEXT, BINARY.
>
> Since this progress reporting is for both COPY FROM and COPY TO,
> wouldn't it be good to have that also as a separate column, say
> command type with values FROM and TO?
>
> Thoughts?

My first patch had more columns (direction - FROM/TO, file bool,
program bool), but it was subject of discussion what's the purpose.
You can check the query by looking at pg_stat_activity by pid to get
more info. To keep it simple I decided to go with a minimal set of
columns for now. It can be extended later. I'm ok to continue on
improving this with more feedback once merged.

> I'm sure some of the points would have already been discussed. I just
> shared my thoughts here.
>
> I have not looked into the patch in detail, but it's missing tests.
> I'm sure we can not add the tests into copy.sql or copy2.sql, can we
> think of adding test cases to TAP or under isolation? I'm not sure
> whether other progress reporting commands have test cases.

I have raised the question in an old thread as well since there are no
tests for other progress commands as well. I was told it is ok for now
to keep it untested, since there's no easy way to do that for now.

https://www.postgresql.org/message-id/20200615001828.GA52676%40paquier.xyz

> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Simple progress reporting for COPY command

From

Tomas Vondra

Date:

04 January 2021, 23:45:57

Hi,

I did take a quick look today, and I have a couple minor comments:

1) The catalog sgml docs seem to mention bytes_processed twice (one of 
that should be bytes_total), and line_processed (should be "lines_").


2) I'm not quite sure about not including any info about the command. 
For example pg_stat_progress_create_index includes info about the 
command, although it's not very detailed. Not sure how useful would it 
be to show just COPY FROM / COPY TO, without more details.

It's probably possible to extract this from pg_stat_activity, but that 
may be rather cumbersome (parsing arbitrary SQL and all that). Also, 
what if the COPY is called from a function etc.?


3) I wonder if we should have something like lines_estimated. For COPY 
TO it's pretty simple to calculate it as

     bytes_total / (bytes_processed / lines_processed)

but what about

     COPY (query) TO file

In that case we don't know the total amount of bytes / rows, so we can't 
calculate any estimate. So maybe we could peek into the query plan? But 
I agree this is something we can add later.


4) This comment is a bit confusing, as it mixes "total" and "processed". 
I'd just say "number of bytes processed so far" instead.



Other than that, it seems fine. I'm sure we could add more features, but 
it seems like a good start - I plan to get this committed once I get a 
patch fixing the docs issues.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

05 January 2021, 01:32:55

út 5. 1. 2021 v 0:46 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:
>
> Hi,
>
> I did take a quick look today, and I have a couple minor comments:

Hi! Thanks for your time.

> 1) The catalog sgml docs seem to mention bytes_processed twice (one of
> that should be bytes_total), and line_processed (should be "lines_").

Fixed in attached patch.

>
> 2) I'm not quite sure about not including any info about the command.
> For example pg_stat_progress_create_index includes info about the
> command, although it's not very detailed. Not sure how useful would it
> be to show just COPY FROM / COPY TO, without more details.
>
> It's probably possible to extract this from pg_stat_activity, but that
> may be rather cumbersome (parsing arbitrary SQL and all that). Also,
> what if the COPY is called from a function etc.?

Any idea where to discuss this? My usecase is really simple. I would
like to be able to see progress of COPY command by pid. There is a lot
of COPY info inside CopyToStateData and CopyFromStateData structs. The
only limitation I see is support of only int64 values for progress
reporting columns. I'm not sure if it is easily possible to expose for
example filename this way.

>
> 3) I wonder if we should have something like lines_estimated. For COPY
> TO it's pretty simple to calculate it as
>
>      bytes_total / (bytes_processed / lines_processed)
>
> but what about
>
>      COPY (query) TO file
>
> In that case we don't know the total amount of bytes / rows, so we can't
> calculate any estimate. So maybe we could peek into the query plan? But
> I agree this is something we can add later.

If I remember well one of the original ideas was to be able to pass
custom bytes_total (or lines_total) by COPY options to be stored in
copy progress. I can add this in some following patch if still
welcomed.

For estimates I would prefer to add an additional column to not mix
those two together (or at least boolean estimated = true/false and
reuse bytes_total column). If query estimates are welcomed, I can take
a look at how to reach the query plan and expose those numbers when
the query is used to estimated_lines and potentially estimated_bytes
columns. It would be probably a little tricky to calculate
estimated_bytes for some column types.

Also currently only COPY FROM file supports bytes_total (it is 0 for
all other scenarios). I think it should be possible for PostgreSQL to
know the actual amount of lines query returns for some kind of
queries, but I have no idea where to look at this. If that's possible
to get, it would be one of the next steps to introduce additional
column lines_total.

>
> 4) This comment is a bit confusing, as it mixes "total" and "processed".
> I'd just say "number of bytes processed so far" instead.
>
Fixed in attached patch.
>
> Other than that, it seems fine. I'm sure we could add more features, but
> it seems like a good start - I plan to get this committed once I get a
> patch fixing the docs issues.

Patch is attached, it should be applied to the top of the previous
patch. Overall patch (having both patches merged together) could be
found at https://patch-diff.githubusercontent.com/raw/simi/postgres/pull/6.patch.

> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Attachment

002-copy-progress.patch

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

05 January 2021, 10:02:25

I'm attaching the whole patch since commitfest failed to ingest the
last incremental on CI.


út 5. 1. 2021 v 2:32 odesílatel Josef Šimánek <josef.simanek@gmail.com> napsal:
>
> út 5. 1. 2021 v 0:46 odesílatel Tomas Vondra
> <tomas.vondra@enterprisedb.com> napsal:
> >
> > Hi,
> >
> > I did take a quick look today, and I have a couple minor comments:
>
> Hi! Thanks for your time.
>
> > 1) The catalog sgml docs seem to mention bytes_processed twice (one of
> > that should be bytes_total), and line_processed (should be "lines_").
>
> Fixed in attached patch.
>
> >
> > 2) I'm not quite sure about not including any info about the command.
> > For example pg_stat_progress_create_index includes info about the
> > command, although it's not very detailed. Not sure how useful would it
> > be to show just COPY FROM / COPY TO, without more details.
> >
> > It's probably possible to extract this from pg_stat_activity, but that
> > may be rather cumbersome (parsing arbitrary SQL and all that). Also,
> > what if the COPY is called from a function etc.?
>
> Any idea where to discuss this? My usecase is really simple. I would
> like to be able to see progress of COPY command by pid. There is a lot
> of COPY info inside CopyToStateData and CopyFromStateData structs. The
> only limitation I see is support of only int64 values for progress
> reporting columns. I'm not sure if it is easily possible to expose for
> example filename this way.
>
> >
> > 3) I wonder if we should have something like lines_estimated. For COPY
> > TO it's pretty simple to calculate it as
> >
> >      bytes_total / (bytes_processed / lines_processed)
> >
> > but what about
> >
> >      COPY (query) TO file
> >
> > In that case we don't know the total amount of bytes / rows, so we can't
> > calculate any estimate. So maybe we could peek into the query plan? But
> > I agree this is something we can add later.
>
> If I remember well one of the original ideas was to be able to pass
> custom bytes_total (or lines_total) by COPY options to be stored in
> copy progress. I can add this in some following patch if still
> welcomed.
>
> For estimates I would prefer to add an additional column to not mix
> those two together (or at least boolean estimated = true/false and
> reuse bytes_total column). If query estimates are welcomed, I can take
> a look at how to reach the query plan and expose those numbers when
> the query is used to estimated_lines and potentially estimated_bytes
> columns. It would be probably a little tricky to calculate
> estimated_bytes for some column types.
>
> Also currently only COPY FROM file supports bytes_total (it is 0 for
> all other scenarios). I think it should be possible for PostgreSQL to
> know the actual amount of lines query returns for some kind of
> queries, but I have no idea where to look at this. If that's possible
> to get, it would be one of the next steps to introduce additional
> column lines_total.
>
> >
> > 4) This comment is a bit confusing, as it mixes "total" and "processed".
> > I'd just say "number of bytes processed so far" instead.
> >
> Fixed in attached patch.
> >
> > Other than that, it seems fine. I'm sure we could add more features, but
> > it seems like a good start - I plan to get this committed once I get a
> > patch fixing the docs issues.
>
> Patch is attached, it should be applied to the top of the previous
> patch. Overall patch (having both patches merged together) could be
> found at https://patch-diff.githubusercontent.com/raw/simi/postgres/pull/6.patch.
>
> > regards
> >
> > --
> > Tomas Vondra
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company

Attachment

001-copy-progress-revivew-1.patch

Re: [PATCH] Simple progress reporting for COPY command

From

Matthias van de Meent

Date:

06 January 2021, 01:04:56

On Fri, 1 Jan 2021 at 02:25, Josef Šimánek <josef.simanek@gmail.com> wrote:
>
> Hello,
>
> finally I had some time to revisit patch and all comments from
> https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
> and I have prepared simple version of COPY command progress reporting.

+1

> To keep the patch small as possible, I have introduced only a minimum
> set of columns. It could be extended later if needed.

Seems reasonable. One potential extention I could think of is progress reporting of tuples processed before/after applying the COPY's WHERE clause, when and where applicable.

> Columns are inspired by CREATE INDEX progress report system view.
>
> pid - integer - PID of backend
> datid - oid - OID of related database
> datname - name - name of related database (this seems redundant, since
> oid should be enough, but it is the same in CREATE INDEX)
> relid - oid - oid of table related to COPY command, when not known
> (for example when copying to file, it is 0)
> bytes_processed - bigint - amount of bytes processed
> bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
> COPY FROM file)
> lines_processed - bigint - amount of tuples processed

Could you update the name of this column to be consistent with the 'tuples'-terminology used in the other progress reporting views, i.e. lines_processed -> tuples_processed? That adds consistency, and also disambiguates the meaning in case of e.g. COPY FROM (format CSV), as multiline CSV fields do exist, and we're not necessarily counting lines from or to a file.

With regards,

Matthias van de Meent

Re: [PATCH] Simple progress reporting for COPY command

From

Tomas Vondra

Date:

06 January 2021, 21:44:49

On 1/5/21 11:02 AM, Josef Šimánek wrote:
> I'm attaching the whole patch since commitfest failed to ingest the
> last incremental on CI.
> 

Yeah, the whole patch needs to be attached for the commitfest tester to 
work correctly - it can't apply pieces from multiple messages, etc.

Anyway, I pushed this last version of patch, after a couple more tweaks, 
mainly to the docs - one place used pg_stat_copy_progress, the section 
was not indexed properly, and so on.

I see Matthias proposed to change "lines" to "tuples" - I only saw the 
message after pushing, but I probably wouldn't make that change anyway. 
The CSV docs seem to talk about lines, newlines etc. so it seems fine. 
If not, we can change that.

One more question, though - I now realize the lines_processed ignores 
rows skipped because of BEFORE INSERT triggers. I wonder if that's the 
right thing to do? Imagine you know the number of lines in a file. You 
can't really use (lines_processed / total_lines) to measure progress, 
because that may ignore many "skipped" rows. So maybe this should be 
changed to count all rows. OTOH we still have bytes_processed.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

07 January 2021, 11:06:11

st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:
>
> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > I'm attaching the whole patch since commitfest failed to ingest the
> > last incremental on CI.
> >
>
> Yeah, the whole patch needs to be attached for the commitfest tester to
> work correctly - it can't apply pieces from multiple messages, etc.
>
> Anyway, I pushed this last version of patch, after a couple more tweaks,
> mainly to the docs - one place used pg_stat_copy_progress, the section
> was not indexed properly, and so on.
>
> I see Matthias proposed to change "lines" to "tuples" - I only saw the
> message after pushing, but I probably wouldn't make that change anyway.
> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> If not, we can change that.
>
> One more question, though - I now realize the lines_processed ignores
> rows skipped because of BEFORE INSERT triggers. I wonder if that's the
> right thing to do? Imagine you know the number of lines in a file. You
> can't really use (lines_processed / total_lines) to measure progress,
> because that may ignore many "skipped" rows. So maybe this should be
> changed to count all rows. OTOH we still have bytes_processed.

I think that should be fixed. It is called "lines_processed" not
"lines_inserted". I'll take a look.

>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Re: [PATCH] Simple progress reporting for COPY command

From

Amit Kapila

Date:

07 January 2021, 13:10:48

On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > I'm attaching the whole patch since commitfest failed to ingest the
> > last incremental on CI.
> >
>
> Yeah, the whole patch needs to be attached for the commitfest tester to
> work correctly - it can't apply pieces from multiple messages, etc.
>
> Anyway, I pushed this last version of patch, after a couple more tweaks,
> mainly to the docs - one place used pg_stat_copy_progress, the section
> was not indexed properly, and so on.
>

How about including a column for command similar to
pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
that command will be useful in the context of COPY as there are many
variants of COPY.

--
With Regards,
Amit Kapila.

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

07 January 2021, 13:32:12

čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:
>
> On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > > I'm attaching the whole patch since commitfest failed to ingest the
> > > last incremental on CI.
> > >
> >
> > Yeah, the whole patch needs to be attached for the commitfest tester to
> > work correctly - it can't apply pieces from multiple messages, etc.
> >
> > Anyway, I pushed this last version of patch, after a couple more tweaks,
> > mainly to the docs - one place used pg_stat_copy_progress, the section
> > was not indexed properly, and so on.
> >
>
> How about including a column for command similar to
> pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
> that command will be useful in the context of COPY as there are many
> variants of COPY.
>

From pg_stat_progress_create_index docs:

The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY,
REINDEX, or REINDEX CONCURRENTLY.

Which values should be present in copy progress? Just COPY FROM or COPY TO?

> With Regards,
> Amit Kapila.

Re: [PATCH] Simple progress reporting for COPY command

From

Tomas Vondra

Date:

07 January 2021, 15:54:51


On 1/7/21 12:06 PM, Josef Šimánek wrote:
> st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra
> <tomas.vondra@enterprisedb.com> napsal:
>>
>> On 1/5/21 11:02 AM, Josef Šimánek wrote:
>>> I'm attaching the whole patch since commitfest failed to ingest the
>>> last incremental on CI.
>>>
>>
>> Yeah, the whole patch needs to be attached for the commitfest tester to
>> work correctly - it can't apply pieces from multiple messages, etc.
>>
>> Anyway, I pushed this last version of patch, after a couple more tweaks,
>> mainly to the docs - one place used pg_stat_copy_progress, the section
>> was not indexed properly, and so on.
>>
>> I see Matthias proposed to change "lines" to "tuples" - I only saw the
>> message after pushing, but I probably wouldn't make that change anyway.
>> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
>> If not, we can change that.
>>
>> One more question, though - I now realize the lines_processed ignores
>> rows skipped because of BEFORE INSERT triggers. I wonder if that's the
>> right thing to do? Imagine you know the number of lines in a file. You
>> can't really use (lines_processed / total_lines) to measure progress,
>> because that may ignore many "skipped" rows. So maybe this should be
>> changed to count all rows. OTOH we still have bytes_processed.
> 
> I think that should be fixed. It is called "lines_processed" not
> "lines_inserted". I'll take a look.
> 

So we may either rename the column to "lines_inserted", or tweak the 
code to count all processed lines. Or track both and have two columns.

regarss

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

07 January 2021, 16:00:18

čt 7. 1. 2021 v 16:54 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:
>
>
>
> On 1/7/21 12:06 PM, Josef Šimánek wrote:
> > st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra
> > <tomas.vondra@enterprisedb.com> napsal:
> >>
> >> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> >>> I'm attaching the whole patch since commitfest failed to ingest the
> >>> last incremental on CI.
> >>>
> >>
> >> Yeah, the whole patch needs to be attached for the commitfest tester to
> >> work correctly - it can't apply pieces from multiple messages, etc.
> >>
> >> Anyway, I pushed this last version of patch, after a couple more tweaks,
> >> mainly to the docs - one place used pg_stat_copy_progress, the section
> >> was not indexed properly, and so on.
> >>
> >> I see Matthias proposed to change "lines" to "tuples" - I only saw the
> >> message after pushing, but I probably wouldn't make that change anyway.
> >> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> >> If not, we can change that.
> >>
> >> One more question, though - I now realize the lines_processed ignores
> >> rows skipped because of BEFORE INSERT triggers. I wonder if that's the
> >> right thing to do? Imagine you know the number of lines in a file. You
> >> can't really use (lines_processed / total_lines) to measure progress,
> >> because that may ignore many "skipped" rows. So maybe this should be
> >> changed to count all rows. OTOH we still have bytes_processed.
> >
> > I think that should be fixed. It is called "lines_processed" not
> > "lines_inserted". I'll take a look.
> >
>
> So we may either rename the column to "lines_inserted", or tweak the
> code to count all processed lines. Or track both and have two columns.

First I'll ensure lines_processed represents the actual amount of
processed lines. If reading from file and some lines are skipped due
to before insert trigger, I consider that one processed as well, even
if it is not inserted. If welcomed, I can add lines_inserted later as
well. But I'm not sure about the use case.

Also thanks to mentioning triggers, I think those could be used to
test the COPY progress (at least some variants). I'll check if I would
be able to add some test cases as well.

> regarss
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Re: [PATCH] Simple progress reporting for COPY command

From

Tomas Vondra

Date:

07 January 2021, 16:37:14

On 1/7/21 2:17 AM, Justin Pryzby wrote:
> On Wed, Jan 06, 2021 at 10:44:49PM +0100, Tomas Vondra wrote:
>> On 1/5/21 11:02 AM, Josef Šimánek wrote:
>>> I'm attaching the whole patch since commitfest failed to ingest the
>>> last incremental on CI.
>>>
>>
>> Yeah, the whole patch needs to be attached for the commitfest tester to work
>> correctly - it can't apply pieces from multiple messages, etc.
>>
>> Anyway, I pushed this last version of patch, after a couple more tweaks,
>> mainly to the docs - one place used pg_stat_copy_progress, the section was
>> not indexed properly, and so on.
> 
> Some more doc fixes.
> 
Thanks, pushed.

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Simple progress reporting for COPY command

From

Matthias van de Meent

Date:

07 January 2021, 18:51:12

On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > I'm attaching the whole patch since commitfest failed to ingest the
> > last incremental on CI.
> >
>
> Yeah, the whole patch needs to be attached for the commitfest tester to
> work correctly - it can't apply pieces from multiple messages, etc.
>
> Anyway, I pushed this last version of patch, after a couple more tweaks,
> mainly to the docs - one place used pg_stat_copy_progress, the section
> was not indexed properly, and so on.

Thank you all, I'd love to use this in the future to keep track of
(e.g.) my backup/restore progress.

For my previously mentioned extension to keep track of filtered tuples
that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
of that, in line with the current column name style of lines.

If so desired, I'll split this off into a different thread & CF entry.

> I see Matthias proposed to change "lines" to "tuples" - I only saw the
> message after pushing, but I probably wouldn't make that change anyway.
> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> If not, we can change that.

The CSV docs, sure. But copy doesn't only process CSVs; it also has
text (which does have a # lines = # tuples / rows guarantee) and
binary (in which the 'line' vocabulary doesn't make sense, and in
which the 'tuples' vocabulary is used). Additionally, most mentions of
postgres' logical rows/tuples in the COPY documentation use the 'rows'
terminology ('tuples' for FORMAT BINARY), and use 'line' only for
external format's textual representation's strings delimited by
newlines (which I believe is not exactly what we're counting).

One common user of COPY is the pg_dump tool, and that uses binary
format by default (iirc).

Additionally, all comments surrounding the *LINES_PROCESSED updates
only mention 'tuples', so I'd like to strongly suggest (a variant of)
attached patch 0002 to keep the vocabulary consistent by using
'tuples' instead of 'lines'.

With regards,

Matthias van de Meent

Attachment

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

07 January 2021, 18:56:12

čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent
<boekewurm+postgres@gmail.com> napsal:
>
> On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > > I'm attaching the whole patch since commitfest failed to ingest the
> > > last incremental on CI.
> > >
> >
> > Yeah, the whole patch needs to be attached for the commitfest tester to
> > work correctly - it can't apply pieces from multiple messages, etc.
> >
> > Anyway, I pushed this last version of patch, after a couple more tweaks,
> > mainly to the docs - one place used pg_stat_copy_progress, the section
> > was not indexed properly, and so on.
>
> Thank you all, I'd love to use this in the future to keep track of
> (e.g.) my backup/restore progress.
>
> For my previously mentioned extension to keep track of filtered tuples
> that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
> of that, in line with the current column name style of lines.

If I understand it well, this column could be used on COPY TO to track
skipped lines because of BEFORE TRIGGER, right? I can include this in
my following patch keeping lines_processed incremented even for
skipped lines as well.

> If so desired, I'll split this off into a different thread & CF entry.
>
> > I see Matthias proposed to change "lines" to "tuples" - I only saw the
> > message after pushing, but I probably wouldn't make that change anyway.
> > The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> > If not, we can change that.
>
> The CSV docs, sure. But copy doesn't only process CSVs; it also has
> text (which does have a # lines = # tuples / rows guarantee) and
> binary (in which the 'line' vocabulary doesn't make sense, and in
> which the 'tuples' vocabulary is used). Additionally, most mentions of
> postgres' logical rows/tuples in the COPY documentation use the 'rows'
> terminology ('tuples' for FORMAT BINARY), and use 'line' only for
> external format's textual representation's strings delimited by
> newlines (which I believe is not exactly what we're counting).
>
> One common user of COPY is the pg_dump tool, and that uses binary
> format by default (iirc).
>
> Additionally, all comments surrounding the *LINES_PROCESSED updates
> only mention 'tuples', so I'd like to strongly suggest (a variant of)
> attached patch 0002 to keep the vocabulary consistent by using
> 'tuples' instead of 'lines'.
>
>
> With regards,
>
> Matthias van de Meent

Re: [PATCH] Simple progress reporting for COPY command

From

Tomas Vondra

Date:

07 January 2021, 21:37:21


On 1/7/21 7:56 PM, Josef Šimánek wrote:
> čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent
> <boekewurm+postgres@gmail.com> napsal:
>>
>> On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> On 1/5/21 11:02 AM, Josef Šimánek wrote:
>>>> I'm attaching the whole patch since commitfest failed to ingest the
>>>> last incremental on CI.
>>>>
>>>
>>> Yeah, the whole patch needs to be attached for the commitfest tester to
>>> work correctly - it can't apply pieces from multiple messages, etc.
>>>
>>> Anyway, I pushed this last version of patch, after a couple more tweaks,
>>> mainly to the docs - one place used pg_stat_copy_progress, the section
>>> was not indexed properly, and so on.
>>
>> Thank you all, I'd love to use this in the future to keep track of
>> (e.g.) my backup/restore progress.
>>
>> For my previously mentioned extension to keep track of filtered tuples
>> that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
>> of that, in line with the current column name style of lines.
> 
> If I understand it well, this column could be used on COPY TO to track
> skipped lines because of BEFORE TRIGGER, right? I can include this in
> my following patch keeping lines_processed incremented even for
> skipped lines as well.
> 

That's my understanding too.

>> If so desired, I'll split this off into a different thread & CF entry.
>>
>>> I see Matthias proposed to change "lines" to "tuples" - I only saw the
>>> message after pushing, but I probably wouldn't make that change anyway.
>>> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
>>> If not, we can change that.
>>
>> The CSV docs, sure. But copy doesn't only process CSVs; it also has
>> text (which does have a # lines = # tuples / rows guarantee) and
>> binary (in which the 'line' vocabulary doesn't make sense, and in
>> which the 'tuples' vocabulary is used). Additionally, most mentions of
>> postgres' logical rows/tuples in the COPY documentation use the 'rows'
>> terminology ('tuples' for FORMAT BINARY), and use 'line' only for
>> external format's textual representation's strings delimited by
>> newlines (which I believe is not exactly what we're counting).
>>
>> One common user of COPY is the pg_dump tool, and that uses binary
>> format by default (iirc).
>>
>> Additionally, all comments surrounding the *LINES_PROCESSED updates
>> only mention 'tuples', so I'd like to strongly suggest (a variant of)
>> attached patch 0002 to keep the vocabulary consistent by using
>> 'tuples' instead of 'lines'.
>>

I'm not particularly attached to the "lines" naming, it just seemed OK 
to me. So if there's consensus to rename this somehow, I'm OK with it.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

07 January 2021, 22:00:24

čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:
>
>
>
> On 1/7/21 7:56 PM, Josef Šimánek wrote:
> > čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent
> > <boekewurm+postgres@gmail.com> napsal:
> >>
> >> On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> >>>
> >>> On 1/5/21 11:02 AM, Josef Šimánek wrote:
> >>>> I'm attaching the whole patch since commitfest failed to ingest the
> >>>> last incremental on CI.
> >>>>
> >>>
> >>> Yeah, the whole patch needs to be attached for the commitfest tester to
> >>> work correctly - it can't apply pieces from multiple messages, etc.
> >>>
> >>> Anyway, I pushed this last version of patch, after a couple more tweaks,
> >>> mainly to the docs - one place used pg_stat_copy_progress, the section
> >>> was not indexed properly, and so on.
> >>
> >> Thank you all, I'd love to use this in the future to keep track of
> >> (e.g.) my backup/restore progress.
> >>
> >> For my previously mentioned extension to keep track of filtered tuples
> >> that are excluded by the WHERE-clause, PFA patch 0001 that keeps track
> >> of that, in line with the current column name style of lines.
> >
> > If I understand it well, this column could be used on COPY TO to track
> > skipped lines because of BEFORE TRIGGER, right? I can include this in
> > my following patch keeping lines_processed incremented even for
> > skipped lines as well.
> >
>
> That's my understanding too.
>
> >> If so desired, I'll split this off into a different thread & CF entry.
> >>
> >>> I see Matthias proposed to change "lines" to "tuples" - I only saw the
> >>> message after pushing, but I probably wouldn't make that change anyway.
> >>> The CSV docs seem to talk about lines, newlines etc. so it seems fine.
> >>> If not, we can change that.
> >>
> >> The CSV docs, sure. But copy doesn't only process CSVs; it also has
> >> text (which does have a # lines = # tuples / rows guarantee) and
> >> binary (in which the 'line' vocabulary doesn't make sense, and in
> >> which the 'tuples' vocabulary is used). Additionally, most mentions of
> >> postgres' logical rows/tuples in the COPY documentation use the 'rows'
> >> terminology ('tuples' for FORMAT BINARY), and use 'line' only for
> >> external format's textual representation's strings delimited by
> >> newlines (which I believe is not exactly what we're counting).
> >>
> >> One common user of COPY is the pg_dump tool, and that uses binary
> >> format by default (iirc).
> >>
> >> Additionally, all comments surrounding the *LINES_PROCESSED updates
> >> only mention 'tuples', so I'd like to strongly suggest (a variant of)
> >> attached patch 0002 to keep the vocabulary consistent by using
> >> 'tuples' instead of 'lines'.
> >>
>
> I'm not particularly attached to the "lines" naming, it just seemed OK
> to me. So if there's consensus to rename this somehow, I'm OK with it.

The problem I do see here is it depends on the "way" of COPY. If
you're copying from CSV file to table, those are actually lines (since
1 line = 1 tuple). But copying from DB to file is copying tuples (but
1 tuple = 1 file line). Line works better here for me personally.

Once I'll fix the problem with triggers (and also another cases if
found), I think we can consider it lines. It will represent amount of
lines processed from file on COPY FROM and amount of lines written to
file in COPY TO form (at least in CSV format). I'm not sure how BINARY
format works, I'll check.
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Re: [PATCH] Simple progress reporting for COPY command

From

Amit Kapila

Date:

08 January 2021, 02:57:24

On Thu, Jan 7, 2021 at 7:02 PM Josef Šimánek <josef.simanek@gmail.com> wrote:
>
> čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:
> >
> > On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> > >
> > > On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > > > I'm attaching the whole patch since commitfest failed to ingest the
> > > > last incremental on CI.
> > > >
> > >
> > > Yeah, the whole patch needs to be attached for the commitfest tester to
> > > work correctly - it can't apply pieces from multiple messages, etc.
> > >
> > > Anyway, I pushed this last version of patch, after a couple more tweaks,
> > > mainly to the docs - one place used pg_stat_copy_progress, the section
> > > was not indexed properly, and so on.
> > >
> >
> > How about including a column for command similar to
> > pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
> > that command will be useful in the context of COPY as there are many
> > variants of COPY.
> >
>
> From pg_stat_progress_create_index docs:
>
> The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY,
> REINDEX, or REINDEX CONCURRENTLY.
>
> Which values should be present in copy progress? Just COPY FROM or COPY TO?
>

Can't we display the entire COPY command? I checked that
pg_stat_statements display the query so there shouldn't be a problem
to display the entire command.

--
With Regards,
Amit Kapila.

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

08 January 2021, 03:12:11

pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:
>
> On Thu, Jan 7, 2021 at 7:02 PM Josef Šimánek <josef.simanek@gmail.com> wrote:
> >
> > čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:
> > >
> > > On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra
> > > <tomas.vondra@enterprisedb.com> wrote:
> > > >
> > > > On 1/5/21 11:02 AM, Josef Šimánek wrote:
> > > > > I'm attaching the whole patch since commitfest failed to ingest the
> > > > > last incremental on CI.
> > > > >
> > > >
> > > > Yeah, the whole patch needs to be attached for the commitfest tester to
> > > > work correctly - it can't apply pieces from multiple messages, etc.
> > > >
> > > > Anyway, I pushed this last version of patch, after a couple more tweaks,
> > > > mainly to the docs - one place used pg_stat_copy_progress, the section
> > > > was not indexed properly, and so on.
> > > >
> > >
> > > How about including a column for command similar to
> > > pg_stat_progress_create_index and pg_stat_progress_cluster? It seems
> > > that command will be useful in the context of COPY as there are many
> > > variants of COPY.
> > >
> >
> > From pg_stat_progress_create_index docs:
> >
> > The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY,
> > REINDEX, or REINDEX CONCURRENTLY.
> >
> > Which values should be present in copy progress? Just COPY FROM or COPY TO?
> >
>
> Can't we display the entire COPY command? I checked that
> pg_stat_statements display the query so there shouldn't be a problem
> to display the entire command.

In previous discussions there was mentioned it doesn't make sense
since you can join with pg_stat_statements on the pid column if
needed. What would be the reason to duplicate that info?

> --
> With Regards,
> Amit Kapila.

Re: [PATCH] Simple progress reporting for COPY command

From

Amit Kapila

Date:

08 January 2021, 04:05:46

On Fri, Jan 8, 2021 at 8:42 AM Josef Šimánek <josef.simanek@gmail.com> wrote:
>
> pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:
> >
> >
> > Can't we display the entire COPY command? I checked that
> > pg_stat_statements display the query so there shouldn't be a problem
> > to display the entire command.
>
> In previous discussions there was mentioned it doesn't make sense
> since you can join with pg_stat_statements on the pid column if
> needed. What would be the reason to duplicate that info?
>

But pg_stat_staments won't be present by default. Also, the same
argument could be applied for the command to be present in
stat_progress views. It occurred to me only when I was trying to
compare what we display in all the progress views. I think there is
some merit in being consistent.

--
With Regards,
Amit Kapila.

Re: [PATCH] Simple progress reporting for COPY command

From

Josef Šimánek

Date:

08 January 2021, 04:15:16

pá 8. 1. 2021 v 5:03 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:
>
> On Fri, Jan 8, 2021 at 8:42 AM Josef Šimánek <josef.simanek@gmail.com> wrote:
> >
> > pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:
> > >
> > >
> > > Can't we display the entire COPY command? I checked that
> > > pg_stat_statements display the query so there shouldn't be a problem
> > > to display the entire command.
> >
> > In previous discussions there was mentioned it doesn't make sense
> > since you can join with pg_stat_statements on the pid column if
> > needed. What would be the reason to duplicate that info?
> >
>
> But pg_stat_staments won't be present by default. Also, the same
> argument could be applied for the command to be present in
> stat_progress views. It occurred to me only when I was trying to
> compare what we display in all the progress views. I think there is
> some merit in being consistent.

Sorry, I mean pg_stat_activity (not pg_stat_statements). That is
included by default (at least in my installations).

> --
> With Regards,
> Amit Kapila.

Re: [PATCH] Simple progress reporting for COPY command

From

Amit Kapila

Date:

08 January 2021, 04:29:36

On Fri, Jan 8, 2021 at 9:45 AM Josef Šimánek <josef.simanek@gmail.com> wrote:
>
> pá 8. 1. 2021 v 5:03 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:
> >
> > On Fri, Jan 8, 2021 at 8:42 AM Josef Šimánek <josef.simanek@gmail.com> wrote:
> > >
> > > pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal:
> > > >
> > > >
> > > > Can't we display the entire COPY command? I checked that
> > > > pg_stat_statements display the query so there shouldn't be a problem
> > > > to display the entire command.
> > >
> > > In previous discussions there was mentioned it doesn't make sense
> > > since you can join with pg_stat_statements on the pid column if
> > > needed. What would be the reason to duplicate that info?
> > >
> >
> > But pg_stat_staments won't be present by default. Also, the same
> > argument could be applied for the command to be present in
> > stat_progress views. It occurred to me only when I was trying to
> > compare what we display in all the progress views. I think there is
> > some merit in being consistent.
>
> Sorry, I mean pg_stat_activity (not pg_stat_statements). That is
> included by default (at least in my installations).
>

Okay, that makes sense but I still wonder why we display it in other
stat_progress views?

--
With Regards,
Amit Kapila.

Re: [PATCH] Simple progress reporting for COPY command

From

Matthias van de Meent

Date:

08 January 2021, 13:30:22

On Thu, 7 Jan 2021 at 23:00, Josef Šimánek <josef.simanek@gmail.com> wrote:
>
> čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra
> <tomas.vondra@enterprisedb.com> napsal:
> >
> > I'm not particularly attached to the "lines" naming, it just seemed OK
> > to me. So if there's consensus to rename this somehow, I'm OK with it.
>
> The problem I do see here is it depends on the "way" of COPY. If
> you're copying from CSV file to table, those are actually lines (since
> 1 line = 1 tuple). But copying from DB to file is copying tuples (but
> 1 tuple = 1 file line). Line works better here for me personally.
>
> Once I'll fix the problem with triggers (and also another cases if
> found), I think we can consider it lines. It will represent amount of
> lines processed from file on COPY FROM and amount of lines written to
> file in COPY TO form (at least in CSV format). I'm not sure how BINARY
> format works, I'll check.

Counterexample that 1 tuple need not be 1 line, in csv/binary:

/*
 * create a table with one tuple containing 1 text field, which consists of
 * 10 newline characters.
 * If you want windows-style lines, replace '\x0A' (\n) with '\x0D0A' (\r\n).
 */
# CREATE TABLE ttab (val) AS
  SELECT * FROM (values (
    repeat(convert_from(E'\x0A'::bytea, 'UTF8'), 10)::text
  )) as v;

# -- indeed, one unix-style line, according to $ wc -l copy.txt
# COPY ttab TO 'copy.txt' (format text);
COPY 1
# TRUNCATE ttab; COPY ttab FROM 'copy.txt' (format text);
COPY 1

# -- 11 lines
# COPY ttab TO 'copy.csv' (format csv);
COPY 1
# TRUNCATE ttab; COPY ttab FROM 'copy.csv' (format csv);
COPY 1

# -- 13 lines
# COPY ttab TO 'copy.bin' (format binary);
COPY 1
# TRUNCATE ttab; COPY ttab FROM 'copy.bin' (format binary);
COPY 1

All of the above copy statements would only report 'lines_processed = 1',
in the progress reporting, while csv/binary line counts are definatively
inconsistent with what the progress reporting shows, because progress
reporting counts tuples / table rows, not the amount of lines in the
external file.

Re: [PATCH] Simple progress reporting for COPY command

From

Bharath Rupireddy

Date:

09 January 2021, 02:04:42

On Fri, Jan 8, 2021 at 7:00 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> On Thu, 7 Jan 2021 at 23:00, Josef Šimánek <josef.simanek@gmail.com> wrote:
> >
> > čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra
> > <tomas.vondra@enterprisedb.com> napsal:
> > >
> > > I'm not particularly attached to the "lines" naming, it just seemed OK
> > > to me. So if there's consensus to rename this somehow, I'm OK with it.
> >
> > The problem I do see here is it depends on the "way" of COPY. If
> > you're copying from CSV file to table, those are actually lines (since
> > 1 line = 1 tuple). But copying from DB to file is copying tuples (but
> > 1 tuple = 1 file line). Line works better here for me personally.
> >
> > Once I'll fix the problem with triggers (and also another cases if
> > found), I think we can consider it lines. It will represent amount of
> > lines processed from file on COPY FROM and amount of lines written to
> > file in COPY TO form (at least in CSV format). I'm not sure how BINARY
> > format works, I'll check.
>
> Counterexample that 1 tuple need not be 1 line, in csv/binary:

Yes in binary format file, 1 tuple is definitely not 1 line.

> /*
>  * create a table with one tuple containing 1 text field, which consists of
>  * 10 newline characters.
>  * If you want windows-style lines, replace '\x0A' (\n) with '\x0D0A' (\r\n).
>  */
> # CREATE TABLE ttab (val) AS
>   SELECT * FROM (values (
>     repeat(convert_from(E'\x0A'::bytea, 'UTF8'), 10)::text
>   )) as v;
>
> # -- indeed, one unix-style line, according to $ wc -l copy.txt
> # COPY ttab TO 'copy.txt' (format text);
> COPY 1
> # TRUNCATE ttab; COPY ttab FROM 'copy.txt' (format text);
> COPY 1
>
> # -- 11 lines
> # COPY ttab TO 'copy.csv' (format csv);
> COPY 1
> # TRUNCATE ttab; COPY ttab FROM 'copy.csv' (format csv);
> COPY 1
>
> # -- 13 lines
> # COPY ttab TO 'copy.bin' (format binary);
> COPY 1
> # TRUNCATE ttab; COPY ttab FROM 'copy.bin' (format binary);
> COPY 1
>
> All of the above copy statements would only report 'lines_processed = 1',
> in the progress reporting, while csv/binary line counts are definatively
> inconsistent with what the progress reporting shows, because progress
> reporting counts tuples / table rows, not the amount of lines in the
> external file.

I agree with that. +1 to change the name of 'lines_processed' column
to 'tuples_processed'. So, the meaning of 'tuples_processed' can be -
for COPY FROM, it's the number of tuples that are written to the
target table, for COPY TO, it's the number of tuples from the source
table or source plan that are written out to the file/stdin.

IMHO, it's good to have at least the following info: the type of
command COPY TO or COPY FROM, the format CSV,BINARY,TEXT and from
where the input comes from or output goes to - FILE, STDIN, STDOUT or
PROGRAM. Otherwise users will have to join pg_stat_progress_copy view
with other views to get the type of copy command or the entire query.
That's sometimes cumbersome to figure out what's the other view name
and write join queries.

Another point related to documentation, I see that the
pg_stat_progress_copy view details are added to monitoring.sgml, but I
didn't find any mention of it in the copy.sgml. For instance, we have
pg_stat_progress_basebackup documentation both in monitoring.sgml and
a mention of it and link to it in pg_basebackup.sgml. Usually, users
will look at copy documentation to find any kind of information
related to it, so better we have it mentioned there as well.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com