Re: [PATCH] Simple progress reporting for COPY command - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: [PATCH] Simple progress reporting for COPY command
Date
Msg-id CALj2ACXstXE5TXazXkDmJAK-40XZsEYTRG0m4pUVf8nBGfHrMA@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Simple progress reporting for COPY command  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
List pgsql-hackers
On Fri, Jan 8, 2021 at 7:00 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> On Thu, 7 Jan 2021 at 23:00, Josef Šimánek <josef.simanek@gmail.com> wrote:
> >
> > čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra
> > <tomas.vondra@enterprisedb.com> napsal:
> > >
> > > I'm not particularly attached to the "lines" naming, it just seemed OK
> > > to me. So if there's consensus to rename this somehow, I'm OK with it.
> >
> > The problem I do see here is it depends on the "way" of COPY. If
> > you're copying from CSV file to table, those are actually lines (since
> > 1 line = 1 tuple). But copying from DB to file is copying tuples (but
> > 1 tuple = 1 file line). Line works better here for me personally.
> >
> > Once I'll fix the problem with triggers (and also another cases if
> > found), I think we can consider it lines. It will represent amount of
> > lines processed from file on COPY FROM and amount of lines written to
> > file in COPY TO form (at least in CSV format). I'm not sure how BINARY
> > format works, I'll check.
>
> Counterexample that 1 tuple need not be 1 line, in csv/binary:

Yes in binary format file, 1 tuple is definitely not 1 line.

> /*
>  * create a table with one tuple containing 1 text field, which consists of
>  * 10 newline characters.
>  * If you want windows-style lines, replace '\x0A' (\n) with '\x0D0A' (\r\n).
>  */
> # CREATE TABLE ttab (val) AS
>   SELECT * FROM (values (
>     repeat(convert_from(E'\x0A'::bytea, 'UTF8'), 10)::text
>   )) as v;
>
> # -- indeed, one unix-style line, according to $ wc -l copy.txt
> # COPY ttab TO 'copy.txt' (format text);
> COPY 1
> # TRUNCATE ttab; COPY ttab FROM 'copy.txt' (format text);
> COPY 1
>
> # -- 11 lines
> # COPY ttab TO 'copy.csv' (format csv);
> COPY 1
> # TRUNCATE ttab; COPY ttab FROM 'copy.csv' (format csv);
> COPY 1
>
> # -- 13 lines
> # COPY ttab TO 'copy.bin' (format binary);
> COPY 1
> # TRUNCATE ttab; COPY ttab FROM 'copy.bin' (format binary);
> COPY 1
>
> All of the above copy statements would only report 'lines_processed = 1',
> in the progress reporting, while csv/binary line counts are definatively
> inconsistent with what the progress reporting shows, because progress
> reporting counts tuples / table rows, not the amount of lines in the
> external file.

I agree with that. +1 to change the name of 'lines_processed' column
to 'tuples_processed'. So, the meaning of 'tuples_processed' can be -
for COPY FROM, it's the number of tuples that are written to the
target table, for COPY TO, it's the number of tuples from the source
table or source plan that are written out to the file/stdin.

IMHO, it's good to have at least the following info: the type of
command COPY TO or COPY FROM, the format CSV,BINARY,TEXT and from
where the input comes from or output goes to - FILE, STDIN, STDOUT or
PROGRAM. Otherwise users will have to join pg_stat_progress_copy view
with other views to get the type of copy command or the entire query.
That's sometimes cumbersome to figure out what's the other view name
and write join queries.

Another point related to documentation, I see that the
pg_stat_progress_copy view details are added to monitoring.sgml, but I
didn't find any mention of it in the copy.sgml. For instance, we have
pg_stat_progress_basebackup documentation both in monitoring.sgml and
a mention of it and link to it in pg_basebackup.sgml. Usually, users
will look at copy documentation to find any kind of information
related to it, so better we have it mentioned there as well.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: japin
Date:
Subject: Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW
Next
From: Bharath Rupireddy
Date:
Subject: Re: logical replication worker accesses catalogs in error context callback