Thread: Extend COPY FROM with HEADER to skip multiple lines

Extend COPY FROM with HEADER to skip multiple lines

From
Shinya Kato
Date:
Hi hackers,

I'd like to propose a new feature for the COPY FROM command to allow
skipping multiple header lines when loading data. This enhancement
would enable files with multi-line headers to be loaded without any
preprocessing, which would significantly improve usability.

In real-world scenarios, it's common for data files to contain
multiple header lines, such as file descriptions or column
explanations. Currently, the COPY command cannot load these files
directly, which requires users to preprocess them with tools like sed
or tail.

Although you can use "COPY t FROM PROGRAM 'tail -n +3 /path/to/file'",
some environments do not have the tail command available.
Additionally, this approach requires superuser privileges or
membership in the pg_execute_server_program role.

This feature also has precedent in other major RDBMS:
- MySQL: LOAD DATA ... IGNORE N LINES [1]
- SQL Server: BULK INSERT … WITH (FIRST ROW=N) [2]
- Oracle SQL*Loader: sqlldr … SKIP=N [3]

I have not yet created a patch, but I am willing to implement an
extension for the HEADER option. I would like to discuss the
specification first.

The specification I have in mind is as follows:
- Command: COPY FROM
- Formats: text and csv
- Option syntax: HEADER [ boolean | integer | MATCH] (Extend the
HEADER option to accept an integer value in addition to the existing
boolean and MATCH keywords.)
- Behavior: Let N be the specified integer.
  - If N < 0, raise an error.
  - If N = 0 or 1, same behavior when boolean is specified.
  - If N > 1, skip the first N rows.

Thoughts?

[1] https://dev.mysql.com/doc/refman/8.4/en/load-data.html#load-data-field-line-handling
[2]
https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver17#firstrow--first_row
[3]
https://docs.oracle.com/en/database/oracle/oracle-database/23/sutil/oracle-sql-loader-commands.html#SUTIL-GUID-84244C46-6AFD-412D-9240-BEB0B5C2718B

--
Best regards,
Shinya Kato
NTT OSS Center



Re: Extend COPY FROM with HEADER to skip multiple lines

From
Fujii Masao
Date:

On 2025/06/09 16:10, Shinya Kato wrote:
> Hi hackers,
> 
> I'd like to propose a new feature for the COPY FROM command to allow
> skipping multiple header lines when loading data. This enhancement
> would enable files with multi-line headers to be loaded without any
> preprocessing, which would significantly improve usability.
> 
> In real-world scenarios, it's common for data files to contain
> multiple header lines, such as file descriptions or column
> explanations. Currently, the COPY command cannot load these files
> directly, which requires users to preprocess them with tools like sed
> or tail.
> 
> Although you can use "COPY t FROM PROGRAM 'tail -n +3 /path/to/file'",
> some environments do not have the tail command available.
> Additionally, this approach requires superuser privileges or
> membership in the pg_execute_server_program role.
> 
> This feature also has precedent in other major RDBMS:
> - MySQL: LOAD DATA ... IGNORE N LINES [1]
> - SQL Server: BULK INSERT … WITH (FIRST ROW=N) [2]
> - Oracle SQL*Loader: sqlldr … SKIP=N [3]
> 
> I have not yet created a patch, but I am willing to implement an
> extension for the HEADER option. I would like to discuss the
> specification first.
> 
> The specification I have in mind is as follows:
> - Command: COPY FROM
> - Formats: text and csv
> - Option syntax: HEADER [ boolean | integer | MATCH] (Extend the
> HEADER option to accept an integer value in addition to the existing
> boolean and MATCH keywords.)
> - Behavior: Let N be the specified integer.
>    - If N < 0, raise an error.
>    - If N = 0 or 1, same behavior when boolean is specified.
>    - If N > 1, skip the first N rows.
> 
> Thoughts?

I generally like the idea.

However, a similar proposal was made earlier [1], and seemingly
some hackers weren't in favor of it. It's probably worth reading
that thread to understand the previous concerns.

Regards,


[1] https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com

-- 
Fujii Masao
NTT DATA Japan Corporation




Re: Extend COPY FROM with HEADER to skip multiple lines

From
Shinya Kato
Date:
2025年6月9日(月) 17:27 Fujii Masao <masao.fujii@oss.nttdata.com>:
> I generally like the idea.

Thanks.

> However, a similar proposal was made earlier [1], and seemingly
> some hackers weren't in favor of it. It's probably worth reading
> that thread to understand the previous concerns.
>
> [1] https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com

Oh, I missed it. I will check it soon.

--
Best regards,
Shinya Kato
NTT OSS Center



Re: Extend COPY FROM with HEADER to skip multiple lines

From
Andrew Dunstan
Date:
On 2025-06-09 Mo 4:27 AM, Fujii Masao wrote:
>
>
> On 2025/06/09 16:10, Shinya Kato wrote:
>> Hi hackers,
>>
>> I'd like to propose a new feature for the COPY FROM command to allow
>> skipping multiple header lines when loading data. This enhancement
>> would enable files with multi-line headers to be loaded without any
>> preprocessing, which would significantly improve usability.
>>
>> In real-world scenarios, it's common for data files to contain
>> multiple header lines, such as file descriptions or column
>> explanations. Currently, the COPY command cannot load these files
>> directly, which requires users to preprocess them with tools like sed
>> or tail.
>>
>> Although you can use "COPY t FROM PROGRAM 'tail -n +3 /path/to/file'",
>> some environments do not have the tail command available.
>> Additionally, this approach requires superuser privileges or
>> membership in the pg_execute_server_program role.
>>
>> This feature also has precedent in other major RDBMS:
>> - MySQL: LOAD DATA ... IGNORE N LINES [1]
>> - SQL Server: BULK INSERT … WITH (FIRST ROW=N) [2]
>> - Oracle SQL*Loader: sqlldr … SKIP=N [3]
>>
>> I have not yet created a patch, but I am willing to implement an
>> extension for the HEADER option. I would like to discuss the
>> specification first.
>>
>> The specification I have in mind is as follows:
>> - Command: COPY FROM
>> - Formats: text and csv
>> - Option syntax: HEADER [ boolean | integer | MATCH] (Extend the
>> HEADER option to accept an integer value in addition to the existing
>> boolean and MATCH keywords.)
>> - Behavior: Let N be the specified integer.
>>    - If N < 0, raise an error.
>>    - If N = 0 or 1, same behavior when boolean is specified.
>>    - If N > 1, skip the first N rows.
>>
>> Thoughts?
>
> I generally like the idea.
>
> However, a similar proposal was made earlier [1], and seemingly
> some hackers weren't in favor of it. It's probably worth reading
> that thread to understand the previous concerns.
>
> Regards,
>
>
> [1] 
> https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com



I think the earlier proposal went rather further than this one, which I 
suspect can be implemented fairly cheaply.

I don't have terribly strong feelings about it, but matching a feature 
implemented elsewhere has some attraction if it can be done easily.

OTOH I'm a bit curious to know what software produces multi-line CSV 
headers.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Extend COPY FROM with HEADER to skip multiple lines

From
Shinya Kato
Date:
> > However, a similar proposal was made earlier [1], and seemingly
> > some hackers weren't in favor of it. It's probably worth reading
> > that thread to understand the previous concerns.
> >
> > [1] https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com
>
> Oh, I missed it. I will check it soon.

I read it.

There are clear differences from the earlier proposal. My sole
motivation is to skip multiple headers, and I don't believe a feature
for skipping footers is necessary. To be clear, I think it's best to
simply extend the current HEADER option.

Regarding the concern about adding ETL-like functionality, this
feature is already implemented in other RDBMSs, which is why I believe
it is also necessary for PostgreSQL.

Honestly, I haven't implemented it yet, so I'm not sure about the
performance. However, I don't expect it to be significantly different
from the current HEADER option that skips a single line.


> I think the earlier proposal went rather further than this one, which I
> suspect can be implemented fairly cheaply.

That's probably it, exactly.

> I don't have terribly strong feelings about it, but matching a feature
> implemented elsewhere has some attraction if it can be done easily.
>
> OTOH I'm a bit curious to know what software produces multi-line CSV
> headers.

Both Pandas and R can create CSV files with multi-line headers
(although I don't personally think this is desirable). Furthermore,
various systems sometimes generate reports as CSV files that
unexpectedly contain multiple header lines.

-- 
Best regards,
Shinya Kato
NTT OSS Center



Re: Extend COPY FROM with HEADER to skip multiple lines

From
Fujii Masao
Date:

On 2025/06/10 9:43, Shinya Kato wrote:
>>> However, a similar proposal was made earlier [1], and seemingly
>>> some hackers weren't in favor of it. It's probably worth reading
>>> that thread to understand the previous concerns.
>>>
>>> [1] https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com
>>
>> Oh, I missed it. I will check it soon.
> 
> I read it.
> 
> There are clear differences from the earlier proposal. My sole
> motivation is to skip multiple headers, and I don't believe a feature
> for skipping footers is necessary. To be clear, I think it's best to
> simply extend the current HEADER option.

Sounds ok to me.


> Regarding the concern about adding ETL-like functionality, this
> feature is already implemented in other RDBMSs, which is why I believe
> it is also necessary for PostgreSQL.
> 
> Honestly, I haven't implemented it yet, so I'm not sure about the
> performance. However, I don't expect it to be significantly different
> from the current HEADER option that skips a single line.

So it seems better for you to implement the patch at first and then
check the performance overhead etc if necessary.

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation




Re: Extend COPY FROM with HEADER to skip multiple lines

From
Shinya Kato
Date:
On Tue, Jun 10, 2025 at 2:34 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> > There are clear differences from the earlier proposal. My sole
> > motivation is to skip multiple headers, and I don't believe a feature
> > for skipping footers is necessary. To be clear, I think it's best to
> > simply extend the current HEADER option.
>
> Sounds ok to me.

Thank you.

> > Regarding the concern about adding ETL-like functionality, this
> > feature is already implemented in other RDBMSs, which is why I believe
> > it is also necessary for PostgreSQL.
> >
> > Honestly, I haven't implemented it yet, so I'm not sure about the
> > performance. However, I don't expect it to be significantly different
> > from the current HEADER option that skips a single line.
>
> So it seems better for you to implement the patch at first and then
> check the performance overhead etc if necessary.

Thank you for your advice. I will create a patch.

--
Best regards,
Shinya Kato
NTT OSS Center

On Tue, Jun 10, 2025 at 2:34 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2025/06/10 9:43, Shinya Kato wrote:
> >>> However, a similar proposal was made earlier [1], and seemingly
> >>> some hackers weren't in favor of it. It's probably worth reading
> >>> that thread to understand the previous concerns.
> >>>
> >>> [1] https://postgr.es/m/CALAY4q8nGSXp0P5uf56vn-mD7reWqZP5k6PS1CGUm26X4FsYJA@mail.gmail.com
> >>
> >> Oh, I missed it. I will check it soon.
> >
> > I read it.
> >
> > There are clear differences from the earlier proposal. My sole
> > motivation is to skip multiple headers, and I don't believe a feature
> > for skipping footers is necessary. To be clear, I think it's best to
> > simply extend the current HEADER option.
>
> Sounds ok to me.
>
>
> > Regarding the concern about adding ETL-like functionality, this
> > feature is already implemented in other RDBMSs, which is why I believe
> > it is also necessary for PostgreSQL.
> >
> > Honestly, I haven't implemented it yet, so I'm not sure about the
> > performance. However, I don't expect it to be significantly different
> > from the current HEADER option that skips a single line.
>
> So it seems better for you to implement the patch at first and then
> check the performance overhead etc if necessary.
>
> Regards,
>
> --
> Fujii Masao
> NTT DATA Japan Corporation
>


--
Best regards,
Shinya Kato
NTT OSS Center



Re: Extend COPY FROM with HEADER to skip multiple lines

From
Dagfinn Ilmari Mannsåker
Date:
Andrew Dunstan <andrew@dunslane.net> writes:

> OTOH I'm a bit curious to know what software produces multi-line CSV
> headers.

AWS CloudFront access logs are stored in S3 as TSV files (one per hour
per CF node) with a two-line header comment where the first line is the
version and the second lists the fields (but not in a form useful for
HEADER MATCH).

Although not useful for the above format, and not intended to derail or
bloat the proposal in this thread, would it be useful to have a mode
that combines skip and match?  I.e. skip N lines, then check the fields
in the one after that against the target columns.

- ilmari