Thread: [PATCH] Simple progress reporting for COPY command
Hello, finally I had some time to revisit patch and all comments from https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com and I have prepared simple version of COPY command progress reporting. To keep the patch small as possible, I have introduced only a minimum set of columns. It could be extended later if needed. Columns are inspired by CREATE INDEX progress report system view. pid - integer - PID of backend datid - oid - OID of related database datname - name - name of related database (this seems redundant, since oid should be enough, but it is the same in CREATE INDEX) relid - oid - oid of table related to COPY command, when not known (for example when copying to file, it is 0) bytes_processed - bigint - amount of bytes processed bytes_total - bigint - file size in bytes if COPY FROM file (0 if not COPY FROM file) lines_processed - bigint - amount of tuples processed example output of progress for common use case (import from CSV file): first console: yr=# COPY test FROM '/home/retro/test.csv' (FORMAT CSV); second console: yr=# SELECT * FROM pg_stat_progress_copy; pid | datid | datname | relid | bytes_processed | bytes_total | lines_processed --------+-------+---------+-------+-----------------+-------------+----------------- 803148 | 16384 | yr | 16394 | 998965248 | 1777777796 | 56730126 (1 row) It is simple to get progress in percents for example by: yr=# SELECT (bytes_processed/bytes_total::decimal)*100 FROM pg_stat_progress_copy WHERE pid = 803148; ?column? ------------------------- 50.04287948706048525800 ^ ~50% of file processed already I did some dead simple benchmarking as well. The difference is not huge. Each command works with 100 millions of tuples. Times are in seconds. test with progress master (32d6287) difference ------------------------- --------------- ------------------ ------------ COPY table TO 46.102 47.499 -1.397 COPY query TO 52.168 49.822 2.346 COPY table TO PROGRAM 52.345 51.882 0.463 COPY query TO PROGRAM 54.141 52.763 1.378 COPY table FROM 88.970 85.161 3.809 COPY table FROM PROGRAM 94.393 90.346 4.047 Properly formatted table (since I'm not sure everyone here would be able to see the table formatted well) and the benchmark source is present at https://github.com/simi/postgres/pull/6. I have also included an example output in there. I'll add this to the current commitfest as well.
Attachment
On Fri, Jan 1, 2021 at 6:55 AM Josef Šimánek <josef.simanek@gmail.com> wrote: > finally I had some time to revisit patch and all comments from > https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com > and I have prepared simple version of COPY command progress reporting. Thanks for the patch. > To keep the patch small as possible, I have introduced only a minimum > set of columns. It could be extended later if needed. > > Columns are inspired by CREATE INDEX progress report system view. > > pid - integer - PID of backend > datid - oid - OID of related database > datname - name - name of related database (this seems redundant, since > oid should be enough, but it is the same in CREATE INDEX) > relid - oid - oid of table related to COPY command, when not known > (for example when copying to file, it is 0) > bytes_processed - bigint - amount of bytes processed > bytes_total - bigint - file size in bytes if COPY FROM file (0 if not > COPY FROM file) It means that, for COPY FROM STDIN, bytes_total will be 0. IIUC, so having this parameter as 0 is an indication of the COPY command being from STDIN? I'm not sure whether it's discussed or not previously, but I personally prefer to have it as a separate column, say a varchar or text column with values STDIN, FILE or PROGRAM may be. And also it will be better if we have the format of the data streaming in, as a separate column, with possible values may be CSV, TEXT, BINARY. Since this progress reporting is for both COPY FROM and COPY TO, wouldn't it be good to have that also as a separate column, say command type with values FROM and TO? Thoughts? I'm sure some of the points would have already been discussed. I just shared my thoughts here. I have not looked into the patch in detail, but it's missing tests. I'm sure we can not add the tests into copy.sql or copy2.sql, can we think of adding test cases to TAP or under isolation? I'm not sure whether other progress reporting commands have test cases. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
pá 1. 1. 2021 v 11:16 odesílatel Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> napsal: > > On Fri, Jan 1, 2021 at 6:55 AM Josef Šimánek <josef.simanek@gmail.com> wrote: > > finally I had some time to revisit patch and all comments from > > https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com > > and I have prepared simple version of COPY command progress reporting. > > Thanks for the patch. > > > To keep the patch small as possible, I have introduced only a minimum > > set of columns. It could be extended later if needed. > > > > Columns are inspired by CREATE INDEX progress report system view. > > > > pid - integer - PID of backend > > datid - oid - OID of related database > > datname - name - name of related database (this seems redundant, since > > oid should be enough, but it is the same in CREATE INDEX) > > relid - oid - oid of table related to COPY command, when not known > > (for example when copying to file, it is 0) > > bytes_processed - bigint - amount of bytes processed > > bytes_total - bigint - file size in bytes if COPY FROM file (0 if not > > COPY FROM file) > > It means that, for COPY FROM STDIN, bytes_total will be 0. IIUC, so > having this parameter as 0 is an indication of the COPY command being > from STDIN? I'm not sure whether it's discussed or not previously, but > I personally prefer to have it as a separate column, say a varchar or > text column with values STDIN, FILE or PROGRAM may be. And also it > will be better if we have the format of the data streaming in, as a > separate column, with possible values may be CSV, TEXT, BINARY. > > Since this progress reporting is for both COPY FROM and COPY TO, > wouldn't it be good to have that also as a separate column, say > command type with values FROM and TO? > > Thoughts? My first patch had more columns (direction - FROM/TO, file bool, program bool), but it was subject of discussion what's the purpose. You can check the query by looking at pg_stat_activity by pid to get more info. To keep it simple I decided to go with a minimal set of columns for now. It can be extended later. I'm ok to continue on improving this with more feedback once merged. > I'm sure some of the points would have already been discussed. I just > shared my thoughts here. > > I have not looked into the patch in detail, but it's missing tests. > I'm sure we can not add the tests into copy.sql or copy2.sql, can we > think of adding test cases to TAP or under isolation? I'm not sure > whether other progress reporting commands have test cases. I have raised the question in an old thread as well since there are no tests for other progress commands as well. I was told it is ok for now to keep it untested, since there's no easy way to do that for now. https://www.postgresql.org/message-id/20200615001828.GA52676%40paquier.xyz > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com
Hi, I did take a quick look today, and I have a couple minor comments: 1) The catalog sgml docs seem to mention bytes_processed twice (one of that should be bytes_total), and line_processed (should be "lines_"). 2) I'm not quite sure about not including any info about the command. For example pg_stat_progress_create_index includes info about the command, although it's not very detailed. Not sure how useful would it be to show just COPY FROM / COPY TO, without more details. It's probably possible to extract this from pg_stat_activity, but that may be rather cumbersome (parsing arbitrary SQL and all that). Also, what if the COPY is called from a function etc.? 3) I wonder if we should have something like lines_estimated. For COPY TO it's pretty simple to calculate it as bytes_total / (bytes_processed / lines_processed) but what about COPY (query) TO file In that case we don't know the total amount of bytes / rows, so we can't calculate any estimate. So maybe we could peek into the query plan? But I agree this is something we can add later. 4) This comment is a bit confusing, as it mixes "total" and "processed". I'd just say "number of bytes processed so far" instead. Other than that, it seems fine. I'm sure we could add more features, but it seems like a good start - I plan to get this committed once I get a patch fixing the docs issues. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
út 5. 1. 2021 v 0:46 odesílatel Tomas Vondra <tomas.vondra@enterprisedb.com> napsal: > > Hi, > > I did take a quick look today, and I have a couple minor comments: Hi! Thanks for your time. > 1) The catalog sgml docs seem to mention bytes_processed twice (one of > that should be bytes_total), and line_processed (should be "lines_"). Fixed in attached patch. > > 2) I'm not quite sure about not including any info about the command. > For example pg_stat_progress_create_index includes info about the > command, although it's not very detailed. Not sure how useful would it > be to show just COPY FROM / COPY TO, without more details. > > It's probably possible to extract this from pg_stat_activity, but that > may be rather cumbersome (parsing arbitrary SQL and all that). Also, > what if the COPY is called from a function etc.? Any idea where to discuss this? My usecase is really simple. I would like to be able to see progress of COPY command by pid. There is a lot of COPY info inside CopyToStateData and CopyFromStateData structs. The only limitation I see is support of only int64 values for progress reporting columns. I'm not sure if it is easily possible to expose for example filename this way. > > 3) I wonder if we should have something like lines_estimated. For COPY > TO it's pretty simple to calculate it as > > bytes_total / (bytes_processed / lines_processed) > > but what about > > COPY (query) TO file > > In that case we don't know the total amount of bytes / rows, so we can't > calculate any estimate. So maybe we could peek into the query plan? But > I agree this is something we can add later. If I remember well one of the original ideas was to be able to pass custom bytes_total (or lines_total) by COPY options to be stored in copy progress. I can add this in some following patch if still welcomed. For estimates I would prefer to add an additional column to not mix those two together (or at least boolean estimated = true/false and reuse bytes_total column). If query estimates are welcomed, I can take a look at how to reach the query plan and expose those numbers when the query is used to estimated_lines and potentially estimated_bytes columns. It would be probably a little tricky to calculate estimated_bytes for some column types. Also currently only COPY FROM file supports bytes_total (it is 0 for all other scenarios). I think it should be possible for PostgreSQL to know the actual amount of lines query returns for some kind of queries, but I have no idea where to look at this. If that's possible to get, it would be one of the next steps to introduce additional column lines_total. > > 4) This comment is a bit confusing, as it mixes "total" and "processed". > I'd just say "number of bytes processed so far" instead. > Fixed in attached patch. > > Other than that, it seems fine. I'm sure we could add more features, but > it seems like a good start - I plan to get this committed once I get a > patch fixing the docs issues. Patch is attached, it should be applied to the top of the previous patch. Overall patch (having both patches merged together) could be found at https://patch-diff.githubusercontent.com/raw/simi/postgres/pull/6.patch. > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Attachment
I'm attaching the whole patch since commitfest failed to ingest the last incremental on CI. út 5. 1. 2021 v 2:32 odesílatel Josef Šimánek <josef.simanek@gmail.com> napsal: > > út 5. 1. 2021 v 0:46 odesílatel Tomas Vondra > <tomas.vondra@enterprisedb.com> napsal: > > > > Hi, > > > > I did take a quick look today, and I have a couple minor comments: > > Hi! Thanks for your time. > > > 1) The catalog sgml docs seem to mention bytes_processed twice (one of > > that should be bytes_total), and line_processed (should be "lines_"). > > Fixed in attached patch. > > > > > 2) I'm not quite sure about not including any info about the command. > > For example pg_stat_progress_create_index includes info about the > > command, although it's not very detailed. Not sure how useful would it > > be to show just COPY FROM / COPY TO, without more details. > > > > It's probably possible to extract this from pg_stat_activity, but that > > may be rather cumbersome (parsing arbitrary SQL and all that). Also, > > what if the COPY is called from a function etc.? > > Any idea where to discuss this? My usecase is really simple. I would > like to be able to see progress of COPY command by pid. There is a lot > of COPY info inside CopyToStateData and CopyFromStateData structs. The > only limitation I see is support of only int64 values for progress > reporting columns. I'm not sure if it is easily possible to expose for > example filename this way. > > > > > 3) I wonder if we should have something like lines_estimated. For COPY > > TO it's pretty simple to calculate it as > > > > bytes_total / (bytes_processed / lines_processed) > > > > but what about > > > > COPY (query) TO file > > > > In that case we don't know the total amount of bytes / rows, so we can't > > calculate any estimate. So maybe we could peek into the query plan? But > > I agree this is something we can add later. > > If I remember well one of the original ideas was to be able to pass > custom bytes_total (or lines_total) by COPY options to be stored in > copy progress. I can add this in some following patch if still > welcomed. > > For estimates I would prefer to add an additional column to not mix > those two together (or at least boolean estimated = true/false and > reuse bytes_total column). If query estimates are welcomed, I can take > a look at how to reach the query plan and expose those numbers when > the query is used to estimated_lines and potentially estimated_bytes > columns. It would be probably a little tricky to calculate > estimated_bytes for some column types. > > Also currently only COPY FROM file supports bytes_total (it is 0 for > all other scenarios). I think it should be possible for PostgreSQL to > know the actual amount of lines query returns for some kind of > queries, but I have no idea where to look at this. If that's possible > to get, it would be one of the next steps to introduce additional > column lines_total. > > > > > 4) This comment is a bit confusing, as it mixes "total" and "processed". > > I'd just say "number of bytes processed so far" instead. > > > Fixed in attached patch. > > > > Other than that, it seems fine. I'm sure we could add more features, but > > it seems like a good start - I plan to get this committed once I get a > > patch fixing the docs issues. > > Patch is attached, it should be applied to the top of the previous > patch. Overall patch (having both patches merged together) could be > found at https://patch-diff.githubusercontent.com/raw/simi/postgres/pull/6.patch. > > > regards > > > > -- > > Tomas Vondra > > EnterpriseDB: http://www.enterprisedb.com > > The Enterprise PostgreSQL Company
Attachment
On Fri, 1 Jan 2021 at 02:25, Josef Šimánek <josef.simanek@gmail.com> wrote:
>
> Hello,
>
> finally I had some time to revisit patch and all comments from
> https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
> and I have prepared simple version of COPY command progress reporting.
+1
> To keep the patch small as possible, I have introduced only a minimum
> set of columns. It could be extended later if needed.
Seems reasonable. One potential extention I could think of is progress reporting of tuples processed before/after applying the COPY's WHERE clause, when and where applicable.
> Columns are inspired by CREATE INDEX progress report system view.
>
> pid - integer - PID of backend
> datid - oid - OID of related database
> datname - name - name of related database (this seems redundant, since
> oid should be enough, but it is the same in CREATE INDEX)
> relid - oid - oid of table related to COPY command, when not known
> (for example when copying to file, it is 0)
> bytes_processed - bigint - amount of bytes processed
> bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
> COPY FROM file)
> lines_processed - bigint - amount of tuples processed
Could you update the name of this column to be consistent with the 'tuples'-terminology used in the other progress reporting views, i.e. lines_processed -> tuples_processed? That adds consistency, and also disambiguates the meaning in case of e.g. COPY FROM (format CSV), as multiline CSV fields do exist, and we're not necessarily counting lines from or to a file.
>
> Hello,
>
> finally I had some time to revisit patch and all comments from
> https://www.postgresql.org/message-id/CAFp7QwqMGEi4OyyaLEK9DR0%2BE%2BoK3UtA4bEjDVCa4bNkwUY2PQ%40mail.gmail.com
> and I have prepared simple version of COPY command progress reporting.
+1
> To keep the patch small as possible, I have introduced only a minimum
> set of columns. It could be extended later if needed.
Seems reasonable. One potential extention I could think of is progress reporting of tuples processed before/after applying the COPY's WHERE clause, when and where applicable.
> Columns are inspired by CREATE INDEX progress report system view.
>
> pid - integer - PID of backend
> datid - oid - OID of related database
> datname - name - name of related database (this seems redundant, since
> oid should be enough, but it is the same in CREATE INDEX)
> relid - oid - oid of table related to COPY command, when not known
> (for example when copying to file, it is 0)
> bytes_processed - bigint - amount of bytes processed
> bytes_total - bigint - file size in bytes if COPY FROM file (0 if not
> COPY FROM file)
> lines_processed - bigint - amount of tuples processed
Could you update the name of this column to be consistent with the 'tuples'-terminology used in the other progress reporting views, i.e. lines_processed -> tuples_processed? That adds consistency, and also disambiguates the meaning in case of e.g. COPY FROM (format CSV), as multiline CSV fields do exist, and we're not necessarily counting lines from or to a file.
With regards,
Matthias van de Meent
On 1/5/21 11:02 AM, Josef Šimánek wrote: > I'm attaching the whole patch since commitfest failed to ingest the > last incremental on CI. > Yeah, the whole patch needs to be attached for the commitfest tester to work correctly - it can't apply pieces from multiple messages, etc. Anyway, I pushed this last version of patch, after a couple more tweaks, mainly to the docs - one place used pg_stat_copy_progress, the section was not indexed properly, and so on. I see Matthias proposed to change "lines" to "tuples" - I only saw the message after pushing, but I probably wouldn't make that change anyway. The CSV docs seem to talk about lines, newlines etc. so it seems fine. If not, we can change that. One more question, though - I now realize the lines_processed ignores rows skipped because of BEFORE INSERT triggers. I wonder if that's the right thing to do? Imagine you know the number of lines in a file. You can't really use (lines_processed / total_lines) to measure progress, because that may ignore many "skipped" rows. So maybe this should be changed to count all rows. OTOH we still have bytes_processed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra <tomas.vondra@enterprisedb.com> napsal: > > On 1/5/21 11:02 AM, Josef Šimánek wrote: > > I'm attaching the whole patch since commitfest failed to ingest the > > last incremental on CI. > > > > Yeah, the whole patch needs to be attached for the commitfest tester to > work correctly - it can't apply pieces from multiple messages, etc. > > Anyway, I pushed this last version of patch, after a couple more tweaks, > mainly to the docs - one place used pg_stat_copy_progress, the section > was not indexed properly, and so on. > > I see Matthias proposed to change "lines" to "tuples" - I only saw the > message after pushing, but I probably wouldn't make that change anyway. > The CSV docs seem to talk about lines, newlines etc. so it seems fine. > If not, we can change that. > > One more question, though - I now realize the lines_processed ignores > rows skipped because of BEFORE INSERT triggers. I wonder if that's the > right thing to do? Imagine you know the number of lines in a file. You > can't really use (lines_processed / total_lines) to measure progress, > because that may ignore many "skipped" rows. So maybe this should be > changed to count all rows. OTOH we still have bytes_processed. I think that should be fixed. It is called "lines_processed" not "lines_inserted". I'll take a look. > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 1/5/21 11:02 AM, Josef Šimánek wrote: > > I'm attaching the whole patch since commitfest failed to ingest the > > last incremental on CI. > > > > Yeah, the whole patch needs to be attached for the commitfest tester to > work correctly - it can't apply pieces from multiple messages, etc. > > Anyway, I pushed this last version of patch, after a couple more tweaks, > mainly to the docs - one place used pg_stat_copy_progress, the section > was not indexed properly, and so on. > How about including a column for command similar to pg_stat_progress_create_index and pg_stat_progress_cluster? It seems that command will be useful in the context of COPY as there are many variants of COPY. -- With Regards, Amit Kapila.
čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal: > > On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 1/5/21 11:02 AM, Josef Šimánek wrote: > > > I'm attaching the whole patch since commitfest failed to ingest the > > > last incremental on CI. > > > > > > > Yeah, the whole patch needs to be attached for the commitfest tester to > > work correctly - it can't apply pieces from multiple messages, etc. > > > > Anyway, I pushed this last version of patch, after a couple more tweaks, > > mainly to the docs - one place used pg_stat_copy_progress, the section > > was not indexed properly, and so on. > > > > How about including a column for command similar to > pg_stat_progress_create_index and pg_stat_progress_cluster? It seems > that command will be useful in the context of COPY as there are many > variants of COPY. > From pg_stat_progress_create_index docs: The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY, REINDEX, or REINDEX CONCURRENTLY. Which values should be present in copy progress? Just COPY FROM or COPY TO? > With Regards, > Amit Kapila.
On 1/7/21 12:06 PM, Josef Šimánek wrote: > st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra > <tomas.vondra@enterprisedb.com> napsal: >> >> On 1/5/21 11:02 AM, Josef Šimánek wrote: >>> I'm attaching the whole patch since commitfest failed to ingest the >>> last incremental on CI. >>> >> >> Yeah, the whole patch needs to be attached for the commitfest tester to >> work correctly - it can't apply pieces from multiple messages, etc. >> >> Anyway, I pushed this last version of patch, after a couple more tweaks, >> mainly to the docs - one place used pg_stat_copy_progress, the section >> was not indexed properly, and so on. >> >> I see Matthias proposed to change "lines" to "tuples" - I only saw the >> message after pushing, but I probably wouldn't make that change anyway. >> The CSV docs seem to talk about lines, newlines etc. so it seems fine. >> If not, we can change that. >> >> One more question, though - I now realize the lines_processed ignores >> rows skipped because of BEFORE INSERT triggers. I wonder if that's the >> right thing to do? Imagine you know the number of lines in a file. You >> can't really use (lines_processed / total_lines) to measure progress, >> because that may ignore many "skipped" rows. So maybe this should be >> changed to count all rows. OTOH we still have bytes_processed. > > I think that should be fixed. It is called "lines_processed" not > "lines_inserted". I'll take a look. > So we may either rename the column to "lines_inserted", or tweak the code to count all processed lines. Or track both and have two columns. regarss -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
čt 7. 1. 2021 v 16:54 odesílatel Tomas Vondra <tomas.vondra@enterprisedb.com> napsal: > > > > On 1/7/21 12:06 PM, Josef Šimánek wrote: > > st 6. 1. 2021 v 22:44 odesílatel Tomas Vondra > > <tomas.vondra@enterprisedb.com> napsal: > >> > >> On 1/5/21 11:02 AM, Josef Šimánek wrote: > >>> I'm attaching the whole patch since commitfest failed to ingest the > >>> last incremental on CI. > >>> > >> > >> Yeah, the whole patch needs to be attached for the commitfest tester to > >> work correctly - it can't apply pieces from multiple messages, etc. > >> > >> Anyway, I pushed this last version of patch, after a couple more tweaks, > >> mainly to the docs - one place used pg_stat_copy_progress, the section > >> was not indexed properly, and so on. > >> > >> I see Matthias proposed to change "lines" to "tuples" - I only saw the > >> message after pushing, but I probably wouldn't make that change anyway. > >> The CSV docs seem to talk about lines, newlines etc. so it seems fine. > >> If not, we can change that. > >> > >> One more question, though - I now realize the lines_processed ignores > >> rows skipped because of BEFORE INSERT triggers. I wonder if that's the > >> right thing to do? Imagine you know the number of lines in a file. You > >> can't really use (lines_processed / total_lines) to measure progress, > >> because that may ignore many "skipped" rows. So maybe this should be > >> changed to count all rows. OTOH we still have bytes_processed. > > > > I think that should be fixed. It is called "lines_processed" not > > "lines_inserted". I'll take a look. > > > > So we may either rename the column to "lines_inserted", or tweak the > code to count all processed lines. Or track both and have two columns. First I'll ensure lines_processed represents the actual amount of processed lines. If reading from file and some lines are skipped due to before insert trigger, I consider that one processed as well, even if it is not inserted. If welcomed, I can add lines_inserted later as well. But I'm not sure about the use case. Also thanks to mentioning triggers, I think those could be used to test the COPY progress (at least some variants). I'll check if I would be able to add some test cases as well. > regarss > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On 1/7/21 2:17 AM, Justin Pryzby wrote: > On Wed, Jan 06, 2021 at 10:44:49PM +0100, Tomas Vondra wrote: >> On 1/5/21 11:02 AM, Josef Šimánek wrote: >>> I'm attaching the whole patch since commitfest failed to ingest the >>> last incremental on CI. >>> >> >> Yeah, the whole patch needs to be attached for the commitfest tester to work >> correctly - it can't apply pieces from multiple messages, etc. >> >> Anyway, I pushed this last version of patch, after a couple more tweaks, >> mainly to the docs - one place used pg_stat_copy_progress, the section was >> not indexed properly, and so on. > > Some more doc fixes. > Thanks, pushed. -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 1/5/21 11:02 AM, Josef Šimánek wrote: > > I'm attaching the whole patch since commitfest failed to ingest the > > last incremental on CI. > > > > Yeah, the whole patch needs to be attached for the commitfest tester to > work correctly - it can't apply pieces from multiple messages, etc. > > Anyway, I pushed this last version of patch, after a couple more tweaks, > mainly to the docs - one place used pg_stat_copy_progress, the section > was not indexed properly, and so on. Thank you all, I'd love to use this in the future to keep track of (e.g.) my backup/restore progress. For my previously mentioned extension to keep track of filtered tuples that are excluded by the WHERE-clause, PFA patch 0001 that keeps track of that, in line with the current column name style of lines. If so desired, I'll split this off into a different thread & CF entry. > I see Matthias proposed to change "lines" to "tuples" - I only saw the > message after pushing, but I probably wouldn't make that change anyway. > The CSV docs seem to talk about lines, newlines etc. so it seems fine. > If not, we can change that. The CSV docs, sure. But copy doesn't only process CSVs; it also has text (which does have a # lines = # tuples / rows guarantee) and binary (in which the 'line' vocabulary doesn't make sense, and in which the 'tuples' vocabulary is used). Additionally, most mentions of postgres' logical rows/tuples in the COPY documentation use the 'rows' terminology ('tuples' for FORMAT BINARY), and use 'line' only for external format's textual representation's strings delimited by newlines (which I believe is not exactly what we're counting). One common user of COPY is the pg_dump tool, and that uses binary format by default (iirc). Additionally, all comments surrounding the *LINES_PROCESSED updates only mention 'tuples', so I'd like to strongly suggest (a variant of) attached patch 0002 to keep the vocabulary consistent by using 'tuples' instead of 'lines'. With regards, Matthias van de Meent
Attachment
čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent <boekewurm+postgres@gmail.com> napsal: > > On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 1/5/21 11:02 AM, Josef Šimánek wrote: > > > I'm attaching the whole patch since commitfest failed to ingest the > > > last incremental on CI. > > > > > > > Yeah, the whole patch needs to be attached for the commitfest tester to > > work correctly - it can't apply pieces from multiple messages, etc. > > > > Anyway, I pushed this last version of patch, after a couple more tweaks, > > mainly to the docs - one place used pg_stat_copy_progress, the section > > was not indexed properly, and so on. > > Thank you all, I'd love to use this in the future to keep track of > (e.g.) my backup/restore progress. > > For my previously mentioned extension to keep track of filtered tuples > that are excluded by the WHERE-clause, PFA patch 0001 that keeps track > of that, in line with the current column name style of lines. If I understand it well, this column could be used on COPY TO to track skipped lines because of BEFORE TRIGGER, right? I can include this in my following patch keeping lines_processed incremented even for skipped lines as well. > If so desired, I'll split this off into a different thread & CF entry. > > > I see Matthias proposed to change "lines" to "tuples" - I only saw the > > message after pushing, but I probably wouldn't make that change anyway. > > The CSV docs seem to talk about lines, newlines etc. so it seems fine. > > If not, we can change that. > > The CSV docs, sure. But copy doesn't only process CSVs; it also has > text (which does have a # lines = # tuples / rows guarantee) and > binary (in which the 'line' vocabulary doesn't make sense, and in > which the 'tuples' vocabulary is used). Additionally, most mentions of > postgres' logical rows/tuples in the COPY documentation use the 'rows' > terminology ('tuples' for FORMAT BINARY), and use 'line' only for > external format's textual representation's strings delimited by > newlines (which I believe is not exactly what we're counting). > > One common user of COPY is the pg_dump tool, and that uses binary > format by default (iirc). > > Additionally, all comments surrounding the *LINES_PROCESSED updates > only mention 'tuples', so I'd like to strongly suggest (a variant of) > attached patch 0002 to keep the vocabulary consistent by using > 'tuples' instead of 'lines'. > > > With regards, > > Matthias van de Meent
On 1/7/21 7:56 PM, Josef Šimánek wrote: > čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent > <boekewurm+postgres@gmail.com> napsal: >> >> On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >>> >>> On 1/5/21 11:02 AM, Josef Šimánek wrote: >>>> I'm attaching the whole patch since commitfest failed to ingest the >>>> last incremental on CI. >>>> >>> >>> Yeah, the whole patch needs to be attached for the commitfest tester to >>> work correctly - it can't apply pieces from multiple messages, etc. >>> >>> Anyway, I pushed this last version of patch, after a couple more tweaks, >>> mainly to the docs - one place used pg_stat_copy_progress, the section >>> was not indexed properly, and so on. >> >> Thank you all, I'd love to use this in the future to keep track of >> (e.g.) my backup/restore progress. >> >> For my previously mentioned extension to keep track of filtered tuples >> that are excluded by the WHERE-clause, PFA patch 0001 that keeps track >> of that, in line with the current column name style of lines. > > If I understand it well, this column could be used on COPY TO to track > skipped lines because of BEFORE TRIGGER, right? I can include this in > my following patch keeping lines_processed incremented even for > skipped lines as well. > That's my understanding too. >> If so desired, I'll split this off into a different thread & CF entry. >> >>> I see Matthias proposed to change "lines" to "tuples" - I only saw the >>> message after pushing, but I probably wouldn't make that change anyway. >>> The CSV docs seem to talk about lines, newlines etc. so it seems fine. >>> If not, we can change that. >> >> The CSV docs, sure. But copy doesn't only process CSVs; it also has >> text (which does have a # lines = # tuples / rows guarantee) and >> binary (in which the 'line' vocabulary doesn't make sense, and in >> which the 'tuples' vocabulary is used). Additionally, most mentions of >> postgres' logical rows/tuples in the COPY documentation use the 'rows' >> terminology ('tuples' for FORMAT BINARY), and use 'line' only for >> external format's textual representation's strings delimited by >> newlines (which I believe is not exactly what we're counting). >> >> One common user of COPY is the pg_dump tool, and that uses binary >> format by default (iirc). >> >> Additionally, all comments surrounding the *LINES_PROCESSED updates >> only mention 'tuples', so I'd like to strongly suggest (a variant of) >> attached patch 0002 to keep the vocabulary consistent by using >> 'tuples' instead of 'lines'. >> I'm not particularly attached to the "lines" naming, it just seemed OK to me. So if there's consensus to rename this somehow, I'm OK with it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra <tomas.vondra@enterprisedb.com> napsal: > > > > On 1/7/21 7:56 PM, Josef Šimánek wrote: > > čt 7. 1. 2021 v 19:51 odesílatel Matthias van de Meent > > <boekewurm+postgres@gmail.com> napsal: > >> > >> On Wed, 6 Jan 2021 at 22:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >>> > >>> On 1/5/21 11:02 AM, Josef Šimánek wrote: > >>>> I'm attaching the whole patch since commitfest failed to ingest the > >>>> last incremental on CI. > >>>> > >>> > >>> Yeah, the whole patch needs to be attached for the commitfest tester to > >>> work correctly - it can't apply pieces from multiple messages, etc. > >>> > >>> Anyway, I pushed this last version of patch, after a couple more tweaks, > >>> mainly to the docs - one place used pg_stat_copy_progress, the section > >>> was not indexed properly, and so on. > >> > >> Thank you all, I'd love to use this in the future to keep track of > >> (e.g.) my backup/restore progress. > >> > >> For my previously mentioned extension to keep track of filtered tuples > >> that are excluded by the WHERE-clause, PFA patch 0001 that keeps track > >> of that, in line with the current column name style of lines. > > > > If I understand it well, this column could be used on COPY TO to track > > skipped lines because of BEFORE TRIGGER, right? I can include this in > > my following patch keeping lines_processed incremented even for > > skipped lines as well. > > > > That's my understanding too. > > >> If so desired, I'll split this off into a different thread & CF entry. > >> > >>> I see Matthias proposed to change "lines" to "tuples" - I only saw the > >>> message after pushing, but I probably wouldn't make that change anyway. > >>> The CSV docs seem to talk about lines, newlines etc. so it seems fine. > >>> If not, we can change that. > >> > >> The CSV docs, sure. But copy doesn't only process CSVs; it also has > >> text (which does have a # lines = # tuples / rows guarantee) and > >> binary (in which the 'line' vocabulary doesn't make sense, and in > >> which the 'tuples' vocabulary is used). Additionally, most mentions of > >> postgres' logical rows/tuples in the COPY documentation use the 'rows' > >> terminology ('tuples' for FORMAT BINARY), and use 'line' only for > >> external format's textual representation's strings delimited by > >> newlines (which I believe is not exactly what we're counting). > >> > >> One common user of COPY is the pg_dump tool, and that uses binary > >> format by default (iirc). > >> > >> Additionally, all comments surrounding the *LINES_PROCESSED updates > >> only mention 'tuples', so I'd like to strongly suggest (a variant of) > >> attached patch 0002 to keep the vocabulary consistent by using > >> 'tuples' instead of 'lines'. > >> > > I'm not particularly attached to the "lines" naming, it just seemed OK > to me. So if there's consensus to rename this somehow, I'm OK with it. The problem I do see here is it depends on the "way" of COPY. If you're copying from CSV file to table, those are actually lines (since 1 line = 1 tuple). But copying from DB to file is copying tuples (but 1 tuple = 1 file line). Line works better here for me personally. Once I'll fix the problem with triggers (and also another cases if found), I think we can consider it lines. It will represent amount of lines processed from file on COPY FROM and amount of lines written to file in COPY TO form (at least in CSV format). I'm not sure how BINARY format works, I'll check. > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Thu, Jan 7, 2021 at 7:02 PM Josef Šimánek <josef.simanek@gmail.com> wrote: > > čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal: > > > > On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > > > > > > On 1/5/21 11:02 AM, Josef Šimánek wrote: > > > > I'm attaching the whole patch since commitfest failed to ingest the > > > > last incremental on CI. > > > > > > > > > > Yeah, the whole patch needs to be attached for the commitfest tester to > > > work correctly - it can't apply pieces from multiple messages, etc. > > > > > > Anyway, I pushed this last version of patch, after a couple more tweaks, > > > mainly to the docs - one place used pg_stat_copy_progress, the section > > > was not indexed properly, and so on. > > > > > > > How about including a column for command similar to > > pg_stat_progress_create_index and pg_stat_progress_cluster? It seems > > that command will be useful in the context of COPY as there are many > > variants of COPY. > > > > From pg_stat_progress_create_index docs: > > The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY, > REINDEX, or REINDEX CONCURRENTLY. > > Which values should be present in copy progress? Just COPY FROM or COPY TO? > Can't we display the entire COPY command? I checked that pg_stat_statements display the query so there shouldn't be a problem to display the entire command. -- With Regards, Amit Kapila.
pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal: > > On Thu, Jan 7, 2021 at 7:02 PM Josef Šimánek <josef.simanek@gmail.com> wrote: > > > > čt 7. 1. 2021 v 14:08 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal: > > > > > > On Thu, Jan 7, 2021 at 3:15 AM Tomas Vondra > > > <tomas.vondra@enterprisedb.com> wrote: > > > > > > > > On 1/5/21 11:02 AM, Josef Šimánek wrote: > > > > > I'm attaching the whole patch since commitfest failed to ingest the > > > > > last incremental on CI. > > > > > > > > > > > > > Yeah, the whole patch needs to be attached for the commitfest tester to > > > > work correctly - it can't apply pieces from multiple messages, etc. > > > > > > > > Anyway, I pushed this last version of patch, after a couple more tweaks, > > > > mainly to the docs - one place used pg_stat_copy_progress, the section > > > > was not indexed properly, and so on. > > > > > > > > > > How about including a column for command similar to > > > pg_stat_progress_create_index and pg_stat_progress_cluster? It seems > > > that command will be useful in the context of COPY as there are many > > > variants of COPY. > > > > > > > From pg_stat_progress_create_index docs: > > > > The command that is running: CREATE INDEX, CREATE INDEX CONCURRENTLY, > > REINDEX, or REINDEX CONCURRENTLY. > > > > Which values should be present in copy progress? Just COPY FROM or COPY TO? > > > > Can't we display the entire COPY command? I checked that > pg_stat_statements display the query so there shouldn't be a problem > to display the entire command. In previous discussions there was mentioned it doesn't make sense since you can join with pg_stat_statements on the pid column if needed. What would be the reason to duplicate that info? > -- > With Regards, > Amit Kapila.
On Fri, Jan 8, 2021 at 8:42 AM Josef Šimánek <josef.simanek@gmail.com> wrote: > > pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal: > > > > > > Can't we display the entire COPY command? I checked that > > pg_stat_statements display the query so there shouldn't be a problem > > to display the entire command. > > In previous discussions there was mentioned it doesn't make sense > since you can join with pg_stat_statements on the pid column if > needed. What would be the reason to duplicate that info? > But pg_stat_staments won't be present by default. Also, the same argument could be applied for the command to be present in stat_progress views. It occurred to me only when I was trying to compare what we display in all the progress views. I think there is some merit in being consistent. -- With Regards, Amit Kapila.
pá 8. 1. 2021 v 5:03 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal: > > On Fri, Jan 8, 2021 at 8:42 AM Josef Šimánek <josef.simanek@gmail.com> wrote: > > > > pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal: > > > > > > > > > Can't we display the entire COPY command? I checked that > > > pg_stat_statements display the query so there shouldn't be a problem > > > to display the entire command. > > > > In previous discussions there was mentioned it doesn't make sense > > since you can join with pg_stat_statements on the pid column if > > needed. What would be the reason to duplicate that info? > > > > But pg_stat_staments won't be present by default. Also, the same > argument could be applied for the command to be present in > stat_progress views. It occurred to me only when I was trying to > compare what we display in all the progress views. I think there is > some merit in being consistent. Sorry, I mean pg_stat_activity (not pg_stat_statements). That is included by default (at least in my installations). > -- > With Regards, > Amit Kapila.
On Fri, Jan 8, 2021 at 9:45 AM Josef Šimánek <josef.simanek@gmail.com> wrote: > > pá 8. 1. 2021 v 5:03 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal: > > > > On Fri, Jan 8, 2021 at 8:42 AM Josef Šimánek <josef.simanek@gmail.com> wrote: > > > > > > pá 8. 1. 2021 v 3:55 odesílatel Amit Kapila <amit.kapila16@gmail.com> napsal: > > > > > > > > > > > > Can't we display the entire COPY command? I checked that > > > > pg_stat_statements display the query so there shouldn't be a problem > > > > to display the entire command. > > > > > > In previous discussions there was mentioned it doesn't make sense > > > since you can join with pg_stat_statements on the pid column if > > > needed. What would be the reason to duplicate that info? > > > > > > > But pg_stat_staments won't be present by default. Also, the same > > argument could be applied for the command to be present in > > stat_progress views. It occurred to me only when I was trying to > > compare what we display in all the progress views. I think there is > > some merit in being consistent. > > Sorry, I mean pg_stat_activity (not pg_stat_statements). That is > included by default (at least in my installations). > Okay, that makes sense but I still wonder why we display it in other stat_progress views? -- With Regards, Amit Kapila.
On Thu, 7 Jan 2021 at 23:00, Josef Šimánek <josef.simanek@gmail.com> wrote: > > čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra > <tomas.vondra@enterprisedb.com> napsal: > > > > I'm not particularly attached to the "lines" naming, it just seemed OK > > to me. So if there's consensus to rename this somehow, I'm OK with it. > > The problem I do see here is it depends on the "way" of COPY. If > you're copying from CSV file to table, those are actually lines (since > 1 line = 1 tuple). But copying from DB to file is copying tuples (but > 1 tuple = 1 file line). Line works better here for me personally. > > Once I'll fix the problem with triggers (and also another cases if > found), I think we can consider it lines. It will represent amount of > lines processed from file on COPY FROM and amount of lines written to > file in COPY TO form (at least in CSV format). I'm not sure how BINARY > format works, I'll check. Counterexample that 1 tuple need not be 1 line, in csv/binary: /* * create a table with one tuple containing 1 text field, which consists of * 10 newline characters. * If you want windows-style lines, replace '\x0A' (\n) with '\x0D0A' (\r\n). */ # CREATE TABLE ttab (val) AS SELECT * FROM (values ( repeat(convert_from(E'\x0A'::bytea, 'UTF8'), 10)::text )) as v; # -- indeed, one unix-style line, according to $ wc -l copy.txt # COPY ttab TO 'copy.txt' (format text); COPY 1 # TRUNCATE ttab; COPY ttab FROM 'copy.txt' (format text); COPY 1 # -- 11 lines # COPY ttab TO 'copy.csv' (format csv); COPY 1 # TRUNCATE ttab; COPY ttab FROM 'copy.csv' (format csv); COPY 1 # -- 13 lines # COPY ttab TO 'copy.bin' (format binary); COPY 1 # TRUNCATE ttab; COPY ttab FROM 'copy.bin' (format binary); COPY 1 All of the above copy statements would only report 'lines_processed = 1', in the progress reporting, while csv/binary line counts are definatively inconsistent with what the progress reporting shows, because progress reporting counts tuples / table rows, not the amount of lines in the external file.
On Fri, Jan 8, 2021 at 7:00 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > On Thu, 7 Jan 2021 at 23:00, Josef Šimánek <josef.simanek@gmail.com> wrote: > > > > čt 7. 1. 2021 v 22:37 odesílatel Tomas Vondra > > <tomas.vondra@enterprisedb.com> napsal: > > > > > > I'm not particularly attached to the "lines" naming, it just seemed OK > > > to me. So if there's consensus to rename this somehow, I'm OK with it. > > > > The problem I do see here is it depends on the "way" of COPY. If > > you're copying from CSV file to table, those are actually lines (since > > 1 line = 1 tuple). But copying from DB to file is copying tuples (but > > 1 tuple = 1 file line). Line works better here for me personally. > > > > Once I'll fix the problem with triggers (and also another cases if > > found), I think we can consider it lines. It will represent amount of > > lines processed from file on COPY FROM and amount of lines written to > > file in COPY TO form (at least in CSV format). I'm not sure how BINARY > > format works, I'll check. > > Counterexample that 1 tuple need not be 1 line, in csv/binary: Yes in binary format file, 1 tuple is definitely not 1 line. > /* > * create a table with one tuple containing 1 text field, which consists of > * 10 newline characters. > * If you want windows-style lines, replace '\x0A' (\n) with '\x0D0A' (\r\n). > */ > # CREATE TABLE ttab (val) AS > SELECT * FROM (values ( > repeat(convert_from(E'\x0A'::bytea, 'UTF8'), 10)::text > )) as v; > > # -- indeed, one unix-style line, according to $ wc -l copy.txt > # COPY ttab TO 'copy.txt' (format text); > COPY 1 > # TRUNCATE ttab; COPY ttab FROM 'copy.txt' (format text); > COPY 1 > > # -- 11 lines > # COPY ttab TO 'copy.csv' (format csv); > COPY 1 > # TRUNCATE ttab; COPY ttab FROM 'copy.csv' (format csv); > COPY 1 > > # -- 13 lines > # COPY ttab TO 'copy.bin' (format binary); > COPY 1 > # TRUNCATE ttab; COPY ttab FROM 'copy.bin' (format binary); > COPY 1 > > All of the above copy statements would only report 'lines_processed = 1', > in the progress reporting, while csv/binary line counts are definatively > inconsistent with what the progress reporting shows, because progress > reporting counts tuples / table rows, not the amount of lines in the > external file. I agree with that. +1 to change the name of 'lines_processed' column to 'tuples_processed'. So, the meaning of 'tuples_processed' can be - for COPY FROM, it's the number of tuples that are written to the target table, for COPY TO, it's the number of tuples from the source table or source plan that are written out to the file/stdin. IMHO, it's good to have at least the following info: the type of command COPY TO or COPY FROM, the format CSV,BINARY,TEXT and from where the input comes from or output goes to - FILE, STDIN, STDOUT or PROGRAM. Otherwise users will have to join pg_stat_progress_copy view with other views to get the type of copy command or the entire query. That's sometimes cumbersome to figure out what's the other view name and write join queries. Another point related to documentation, I see that the pg_stat_progress_copy view details are added to monitoring.sgml, but I didn't find any mention of it in the copy.sgml. For instance, we have pg_stat_progress_basebackup documentation both in monitoring.sgml and a mention of it and link to it in pg_basebackup.sgml. Usually, users will look at copy documentation to find any kind of information related to it, so better we have it mentioned there as well. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com