New COPY options: DELIMITER NONE and QUOTE NONE - Mailing list pgsql-hackers

From Joel Jacobson
Subject New COPY options: DELIMITER NONE and QUOTE NONE
Date
Msg-id 748b7bfc-9f16-4bea-b7e1-4a59dc9354f5@app.fastmail.com
Whole thread Raw
Responses Re: New COPY options: DELIMITER NONE and QUOTE NONE
List pgsql-hackers
The thread "Should CSV parsing be stricter about mid-field quotes?" [1] forked
into a new topic, with two new ideas, hence this new thread.

1. COPY ... QUOTE NONE

In the [1] thread, Andrew Dunstan suggested a trick on how to deal with
unquoted but delimited files, such as TSV-files produced by Google Sheets:

> You can use CSV mode pretty reliably for TSV files.
> The trick is to use a quoting char that shouldn't appear,
> such as E'\x01' as well as setting the delimiter to E'\t'.
> Yes, it's far from obvious.

Would it be an improvement to allow specifying `QUOTE NONE` instead?

quotes.tsv:
id quote
1 "E = mc^2" -- Albert Einstein

COPY quotes FROM '/tmp/quotes.tsv' WITH CSV HEADER DELIMITER E'\t' QUOTE NONE;

SELECT * FROM quotes;
id |             quote
----+-------------------------------
  1 | "E = mc^2" -- Albert Einstein
(1 row)

2. COPY ... DELIMITER NONE

This is meant to improve the use-case when wanting to import e.g. an
unstructured log file line-by-line into a single column.

The current trick I've been using is similar to the first one,
that is, to specify a non-existing delimiter. But that involves having to find
some non-existing byte, which is error-prone since future log files might
suddenly start to contain it. So I think it would be better to be to be explicit
about not wanting to delimit fields at all, treating the entire whole line as a column.

Example:

% cat /tmp/apache.log
192.168.1.1 - - [19/May/2023:09:54:17 -0700] "GET /index.html HTTP/1.1" 200 431 "http://www.example.com/home.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
192.168.1.2 - - [19/May/2023:09:55:12 -0700] "POST /form.php HTTP/1.1" 200 512 "http://www.example.com/form.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"

CREATE TABLE unstructured_log (whole_line text);
COPY unstructured_log FROM '/tmp/apache.log' WITH CSV DELIMITER NONE QUOTE NONE;
SELECT * FROM unstructured_log;
                                                                                                               whole_line
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
192.168.1.1 - - [19/May/2023:09:54:17 -0700] "GET /index.html HTTP/1.1" 200 431 "http://www.example.com/home.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
192.168.1.2 - - [19/May/2023:09:55:12 -0700] "POST /form.php HTTP/1.1" 200 512 "http://www.example.com/form.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
(2 rows)

I hacked together a broken patch just to demonstrate the idea on syntax
and basic idea. The `COPY ... FROM` examples above works.
But it doesn't work at all for `COPY ... TO`, since it output \0 byte as
delimiter and quote in the output, which is of course not what we want.

Just wanted some feedback to see if there is any broader interest in this,
before proceeding and looking into how to implement it properly.

Is this something we want or are there just a few of us who have needed this in the past?

/Joel

Attachment

pgsql-hackers by date:

Previous
From: "Drouvot, Bertrand"
Date:
Subject: Re: PG 16 draft release notes ready
Next
From: Alvaro Herrera
Date:
Subject: Re: very long record lines in expanded psql output