Thread: Extend COPY FROM with HEADER to skip multiple lines
Hi hackers, I'd like to propose a new feature for the COPY FROM command to allow skipping multiple header lines when loading data. This enhancement would enable files with multi-line headers to be loaded without any preprocessing, which would significantly improve usability. In real-world scenarios, it's common for data files to contain multiple header lines, such as file descriptions or column explanations. Currently, the COPY command cannot load these files directly, which requires users to preprocess them with tools like sed or tail. Although you can use "COPY t FROM PROGRAM 'tail -n +3 /path/to/file'", some environments do not have the tail command available. Additionally, this approach requires superuser privileges or membership in the pg_execute_server_program role. This feature also has precedent in other major RDBMS: - MySQL: LOAD DATA ... IGNORE N LINES [1] - SQL Server: BULK INSERT … WITH (FIRST ROW=N) [2] - Oracle SQL*Loader: sqlldr … SKIP=N [3] I have not yet created a patch, but I am willing to implement an extension for the HEADER option. I would like to discuss the specification first. The specification I have in mind is as follows: - Command: COPY FROM - Formats: text and csv - Option syntax: HEADER [ boolean | integer | MATCH] (Extend the HEADER option to accept an integer value in addition to the existing boolean and MATCH keywords.) - Behavior: Let N be the specified integer. - If N < 0, raise an error. - If N = 0 or 1, same behavior when boolean is specified. - If N > 1, skip the first N rows. Thoughts? [1] https://dev.mysql.com/doc/refman/8.4/en/load-data.html#load-data-field-line-handling [2] https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver17#firstrow--first_row [3] https://docs.oracle.com/en/database/oracle/oracle-database/23/sutil/oracle-sql-loader-commands.html#SUTIL-GUID-84244C46-6AFD-412D-9240-BEB0B5C2718B -- Best regards, Shinya Kato NTT OSS Center
On 2025/06/09 16:10, Shinya Kato wrote: > Hi hackers, > > I'd like to propose a new feature for the COPY FROM command to allow > skipping multiple header lines when loading data. This enhancement > would enable files with multi-line headers to be loaded without any > preprocessing, which would significantly improve usability. > > In real-world scenarios, it's common for data files to contain > multiple header lines, such as file descriptions or column > explanations. Currently, the COPY command cannot load these files > directly, which requires users to preprocess them with tools like sed > or tail. > > Although you can use "COPY t FROM PROGRAM 'tail -n +3 /path/to/file'", > some environments do not have the tail command available. > Additionally, this approach requires superuser privileges or > membership in the pg_execute_server_program role. > > This feature also has precedent in other major RDBMS: > - MySQL: LOAD DATA ... IGNORE N LINES [1] > - SQL Server: BULK INSERT … WITH (FIRST ROW=N) [2] > - Oracle SQL*Loader: sqlldr … SKIP=N [3] > > I have not yet created a patch, but I am willing to implement an > extension for the HEADER option. I would like to discuss the > specification first. > > The specification I have in mind is as follows: > - Command: COPY FROM > - Formats: text and csv > - Option syntax: HEADER [ boolean | integer | MATCH] (Extend the > HEADER option to accept an integer value in addition to the existing > boolean and MATCH keywords.) > - Behavior: Let N be the specified integer. > - If N < 0, raise an error. > - If N = 0 or 1, same behavior when boolean is specified. > - If N > 1, skip the first N rows. > > Thoughts? I generally like the idea. However, a similar proposal was made earlier [1], and seemingly some hackers weren't in favor of it. It's probably worth reading that thread to understand the previous concerns. Regards, [1] https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com -- Fujii Masao NTT DATA Japan Corporation
2025年6月9日(月) 17:27 Fujii Masao <masao.fujii@oss.nttdata.com>: > I generally like the idea. Thanks. > However, a similar proposal was made earlier [1], and seemingly > some hackers weren't in favor of it. It's probably worth reading > that thread to understand the previous concerns. > > [1] https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com Oh, I missed it. I will check it soon. -- Best regards, Shinya Kato NTT OSS Center
On 2025-06-09 Mo 4:27 AM, Fujii Masao wrote: > > > On 2025/06/09 16:10, Shinya Kato wrote: >> Hi hackers, >> >> I'd like to propose a new feature for the COPY FROM command to allow >> skipping multiple header lines when loading data. This enhancement >> would enable files with multi-line headers to be loaded without any >> preprocessing, which would significantly improve usability. >> >> In real-world scenarios, it's common for data files to contain >> multiple header lines, such as file descriptions or column >> explanations. Currently, the COPY command cannot load these files >> directly, which requires users to preprocess them with tools like sed >> or tail. >> >> Although you can use "COPY t FROM PROGRAM 'tail -n +3 /path/to/file'", >> some environments do not have the tail command available. >> Additionally, this approach requires superuser privileges or >> membership in the pg_execute_server_program role. >> >> This feature also has precedent in other major RDBMS: >> - MySQL: LOAD DATA ... IGNORE N LINES [1] >> - SQL Server: BULK INSERT … WITH (FIRST ROW=N) [2] >> - Oracle SQL*Loader: sqlldr … SKIP=N [3] >> >> I have not yet created a patch, but I am willing to implement an >> extension for the HEADER option. I would like to discuss the >> specification first. >> >> The specification I have in mind is as follows: >> - Command: COPY FROM >> - Formats: text and csv >> - Option syntax: HEADER [ boolean | integer | MATCH] (Extend the >> HEADER option to accept an integer value in addition to the existing >> boolean and MATCH keywords.) >> - Behavior: Let N be the specified integer. >> - If N < 0, raise an error. >> - If N = 0 or 1, same behavior when boolean is specified. >> - If N > 1, skip the first N rows. >> >> Thoughts? > > I generally like the idea. > > However, a similar proposal was made earlier [1], and seemingly > some hackers weren't in favor of it. It's probably worth reading > that thread to understand the previous concerns. > > Regards, > > > [1] > https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com I think the earlier proposal went rather further than this one, which I suspect can be implemented fairly cheaply. I don't have terribly strong feelings about it, but matching a feature implemented elsewhere has some attraction if it can be done easily. OTOH I'm a bit curious to know what software produces multi-line CSV headers. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
> > However, a similar proposal was made earlier [1], and seemingly > > some hackers weren't in favor of it. It's probably worth reading > > that thread to understand the previous concerns. > > > > [1] https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com > > Oh, I missed it. I will check it soon. I read it. There are clear differences from the earlier proposal. My sole motivation is to skip multiple headers, and I don't believe a feature for skipping footers is necessary. To be clear, I think it's best to simply extend the current HEADER option. Regarding the concern about adding ETL-like functionality, this feature is already implemented in other RDBMSs, which is why I believe it is also necessary for PostgreSQL. Honestly, I haven't implemented it yet, so I'm not sure about the performance. However, I don't expect it to be significantly different from the current HEADER option that skips a single line. > I think the earlier proposal went rather further than this one, which I > suspect can be implemented fairly cheaply. That's probably it, exactly. > I don't have terribly strong feelings about it, but matching a feature > implemented elsewhere has some attraction if it can be done easily. > > OTOH I'm a bit curious to know what software produces multi-line CSV > headers. Both Pandas and R can create CSV files with multi-line headers (although I don't personally think this is desirable). Furthermore, various systems sometimes generate reports as CSV files that unexpectedly contain multiple header lines. -- Best regards, Shinya Kato NTT OSS Center