Thread: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
Hi,
From time to time, I need to deal with bizarrely delimited text files,
having to use tools such as awk/sed/perl to reformat the files
so that they can be copied into PostgreSQL.
If such files could be imported to a table with a single text column,
we could then use PostgreSQL's now really fast regex-engine
to do data cleaning/reformatting, instead of having to rely on external awk-ward tools.
Furthermore, sometimes you don't want to clean/reformat the data at all,
but simply import the text lines "as is" without modifications,
such as when wanting to import unformatted log files,
where the log lines can contain any characters.
Could it be an idea to exploit the fact that DELIMITER E'\n' is currently an error?
ERROR: COPY delimiter cannot be newline or carriage return
That is, to change E'\n' to be a valid delimiter, which would simply read each line
delimited by newlines, as a single column.
The hack I'm currently abusing is to find some one-byte character that is not present anywhere in the text file,
and then to use that character as a delimiter. This doesn't work when needing to deal with a text file
which content is unknown at the time when writing the code though, so it's mostly useful for throwaway one-off queries.
Thoughts?
/Joel
MySQL seems to already support using \n as a delimiter (I haven't verified it myself though) [1]
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
"David G. Johnston"
Date:
On Wed, May 5, 2021 at 8:31 AM Joel Jacobson <joel@compiler.org> wrote:
Could it be an idea to exploit the fact that DELIMITER E'\n' is currently an error?
Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly what it says and does exactly what you desire?
David J.
On 05/05/21 13:02, David G. Johnston wrote: > Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly > what it says and does exactly what you desire? What would it mean? That you get one column, multiple rows of text corresponding to "lines" delimited by something, or that you get one column, one row of text for the entire content of the file? Regards, -Chap
On Wed, 5 May 2021 at 13:23, Chapman Flack <chap@anastigmatix.net> wrote:
On 05/05/21 13:02, David G. Johnston wrote:
> Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> what it says and does exactly what you desire?
What would it mean? That you get one column, multiple rows of text
corresponding to "lines" delimited by something, or that you get one
column, one row of text for the entire content of the file?
It means no column delimiter. In other words, there is no character which marks the end of a data value, so the entire line is a single data value.
Would DELIMITER NULL make sense? The existing values are literal strings so NULL fits with that. Do we already have NONE as a keyword somewhere? It's listed in the keyword appendix to the documentation but I can't think of where it is used off the top of my head.
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
"David G. Johnston"
Date:
On Wed, May 5, 2021 at 10:34 AM Isaac Morland <isaac.morland@gmail.com> wrote:
On Wed, 5 May 2021 at 13:23, Chapman Flack <chap@anastigmatix.net> wrote:On 05/05/21 13:02, David G. Johnston wrote:
> Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> what it says and does exactly what you desire?
What would it mean? That you get one column, multiple rows of text
corresponding to "lines" delimited by something, or that you get one
column, one row of text for the entire content of the file?It means no column delimiter. In other words, there is no character which marks the end of a data value, so the entire line is a single data value.
This. When dealing with COPY it's expected that each line becomes its own row. On the server you can do pg_read_file() if you need the entire file to be considered a single value. psql (\I and variables) is a bit more hackey, but I'd rather see that improved directly anyway if the goal is to try and make getting the "whole document" easier - copy isn't the right API for that IMO.
David J.
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
"Joel Jacobson"
Date:
On Wed, May 5, 2021, at 19:58, David G. Johnston wrote:
On Wed, May 5, 2021 at 10:34 AM Isaac Morland <isaac.morland@gmail.com> wrote:On Wed, 5 May 2021 at 13:23, Chapman Flack <chap@anastigmatix.net> wrote:On 05/05/21 13:02, David G. Johnston wrote:> Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly> what it says and does exactly what you desire?What would it mean? That you get one column, multiple rows of textcorresponding to "lines" delimited by something, or that you get onecolumn, one row of text for the entire content of the file?It means no column delimiter. In other words, there is no character which marks the end of a data value, so the entire line is a single data value.This. When dealing with COPY it's expected that each line becomes its own row. On the server you can do pg_read_file() if you need the entire file to be considered a single value. psql (\I and variables) is a bit more hackey, but I'd rather see that improved directly anyway if the goal is to try and make getting the "whole document" easier - copy isn't the right API for that IMO.
I think you misunderstood the problem.
I don't want the entire file to be considered a single value.
I want each line to become its own row, just a row with a single column.
So I actually think COPY seems like a perfect match for the job,
since it does precisely that, except there is no delimiter in this case.
I'm currently using the pg_read_file()-hack in a project,
and even though it can read files up to 1GB,
using e.g. regexp_split_to_table() to split on E'\n'
seems to need 4x as much memory, so it only
works with files less than ~256MB.
SELECT COUNT(*) FROM regexp_split_to_table(repeat(E'\n',1000000000),E'\n');
ERROR: invalid memory alloc request size 4000000004
Time: 4151.374 ms (00:04.151)
/Joel
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
"Joel Jacobson"
Date:
On Wed, May 5, 2021, at 19:34, Isaac Morland wrote:
Would DELIMITER NULL make sense? The existing values are literal strings so NULL fits with that. Do we already have NONE as a keyword somewhere? It's listed in the keyword appendix to the documentation but I can't think of where it is used off the top of my head.
+1 to using some keyword. NULL or NONE seems fine to me. Or maybe WITHOUT DELIMITER?
/Joel
"Joel Jacobson" <joel@compiler.org> writes: > I think you misunderstood the problem. > I don't want the entire file to be considered a single value. > I want each line to become its own row, just a row with a single column. > So I actually think COPY seems like a perfect match for the job, > since it does precisely that, except there is no delimiter in this case. Well, there's more to it than just the column delimiter. * What about \N being converted to NULL? * What about \. being treated as EOF? * Do you want to turn off the special behavior of backslash (ESCAPE) altogether? * What about newline conversions (\r\n being seen as just \n, etc)? I'm inclined to think that "use pg_read_file and then split at newlines" might be a saner answer than delving into all these fine points. Not least because people yell when you add cycles to the COPY inner loops. > I'm currently using the pg_read_file()-hack in a project, > and even though it can read files up to 1GB, > using e.g. regexp_split_to_table() to split on E'\n' > seems to need 4x as much memory, so it only > works with files less than ~256MB. Yeah, that's because of the conversion to "chr". But a regexp is overkill for that anyway. Don't we have something that will split on simple substring matches? regards, tom lane
On 5/5/21 2:45 PM, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> I think you misunderstood the problem. >> I don't want the entire file to be considered a single value. >> I want each line to become its own row, just a row with a single column. >> So I actually think COPY seems like a perfect match for the job, >> since it does precisely that, except there is no delimiter in this case. > Well, there's more to it than just the column delimiter. > > * What about \N being converted to NULL? > * What about \. being treated as EOF? > * Do you want to turn off the special behavior of backslash (ESCAPE) > altogether? > * What about newline conversions (\r\n being seen as just \n, etc)? > > I'm inclined to think that "use pg_read_file and then split at newlines" > might be a saner answer than delving into all these fine points. > Not least because people yell when you add cycles to the COPY > inner loops. +1 Also we have generally been resistant to supporting odd formats. FDWs can help here (e.g. file_text_array), but they can't use STDIN IIRC. > >> I'm currently using the pg_read_file()-hack in a project, >> and even though it can read files up to 1GB, >> using e.g. regexp_split_to_table() to split on E'\n' >> seems to need 4x as much memory, so it only >> works with files less than ~256MB. > Yeah, that's because of the conversion to "chr". But a regexp > is overkill for that anyway. Don't we have something that will > split on simple substring matches? > > Not that I know of. There is split_part but I don't think that's fit for purpose here. Do we need one, or have I missed something? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 5/5/21 3:36 PM, Justin Pryzby wrote: > On Wed, May 05, 2021 at 02:45:41PM -0400, Tom Lane wrote: >>> I'm currently using the pg_read_file()-hack in a project, >>> and even though it can read files up to 1GB, >>> using e.g. regexp_split_to_table() to split on E'\n' >>> seems to need 4x as much memory, so it only >>> works with files less than ~256MB. >> Yeah, that's because of the conversion to "chr". But a regexp >> is overkill for that anyway. Don't we have something that will >> split on simple substring matches? > For v14 > > commit 66f163068030b5c5fe792a0daee27822dac43791 > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Wed Sep 2 18:23:56 2020 -0400 > > Add string_to_table() function. > Ha! just in time :-) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 5/5/21 2:45 PM, Tom Lane wrote: >> Yeah, that's because of the conversion to "chr". But a regexp >> is overkill for that anyway. Don't we have something that will >> split on simple substring matches? > Not that I know of. There is split_part but I don't think that's fit for > purpose here. Do we need one, or have I missed something? [ checks manual ... ] string_to_array or string_to_table would do, I think. regards, tom lane
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
"Joel Jacobson"
Date:
On Wed, May 5, 2021, at 21:51, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:> On 5/5/21 2:45 PM, Tom Lane wrote:>> Yeah, that's because of the conversion to "chr". But a regexp>> is overkill for that anyway. Don't we have something that will>> split on simple substring matches?> Not that I know of. There is split_part but I don't think that's fit for> purpose here. Do we need one, or have I missed something?[ checks manual ... ]string_to_array or string_to_table would do, I think.regards, tom lane
Thanks for these new functions, they seem really useful for a lot of cases.
However, I see two problems with using string_to_table() for this particular use-case.
- Doesn't work with files larger than 1GB, due to pg_read_file()'s limit.
- 68% slower than using the COPY-hack.
% ls -lah foo.txt
-rw-r--r-- 1 joel staff 623M May 6 07:31 foo.txt
% wc -l foo.txt
6771864 foo.txt
# \d txt
Table "public.txt"
Column | Type | Collation | Nullable | Default
--------+------+-----------+----------+---------
line | text | | |
# COPY txt (line) FROM 'foo.txt' DELIMITER '"';
COPY 6771864
Time: 9829.707 ms (00:09.830)
Time: 9552.286 ms (00:09.552)
Time: 9483.115 ms (00:09.483)
# TRUNCATE txt;
TRUNCATE TABLE
# INSERT INTO txt (line) SELECT string_to_table(pg_read_file('foo.txt'),E'\n');
INSERT 0 6771865
Time: 16556.078 ms (00:16.556)
Time: 14720.343 ms (00:14.720)
Time: 17266.088 ms (00:17.266)
/Joel
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
Thomas Kellerer
Date:
Joel Jacobson schrieb am 05.05.2021 um 17:30: > Could it be an idea to exploit the fact that DELIMITER E'\n' is currently an error? > > ERROR: COPY delimiter cannot be newline or carriage return > > That is, to change E'\n' to be a valid delimiter, which would simply read each line > delimited by newlines, as a single column. > > The hack I'm currently abusing is to find some one-byte character that is not present anywhere in the text file, > and then to use that character as a delimiter. This doesn't work when needing to deal with a text file > which content is unknown at the time when writing the code though, so it's mostly useful for throwaway one-off queries. What about delimiter E'\1' The probability that a file contains the ASCII "character" 1 seems rather low. Thomas
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
"Joel Jacobson"
Date:
On Wed, May 5, 2021, at 20:45, Tom Lane wrote:
"Joel Jacobson" <joel@compiler.org> writes:> I think you misunderstood the problem.> I don't want the entire file to be considered a single value.> I want each line to become its own row, just a row with a single column.> So I actually think COPY seems like a perfect match for the job,> since it does precisely that, except there is no delimiter in this case.Well, there's more to it than just the column delimiter.* What about \N being converted to NULL?* What about \. being treated as EOF?* Do you want to turn off the special behavior of backslash (ESCAPE)altogether?* What about newline conversions (\r\n being seen as just \n, etc)?I'm inclined to think that "use pg_read_file and then split at newlines"might be a saner answer than delving into all these fine points.Not least because people yell when you add cycles to the COPYinner loops.
Thanks for providing strong arguments why the COPY approach is a dead-end, I agree.
However, as demonstrated in my previous email, using
string_to_table(pg_read_file( filename ), E'\n')
has its performance as well as max size issues.
Maybe these two problems could be solved by combining the two functions into one?
file_to_table ( filename text, delimiter text [, null_string text ] ) → setof text
I'm thinking thanks to returning "setof text", such a function could read a stream,
and return a line as soon as a delimiter is encountered, not having to keep
the entire file in memory at any time.
/Joel
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
Darafei "Komяpa" Praliaskouski
Date:
I have similar problems and what is really needed is a way to get a file from client side into a server side object that can be dealt with later.
The most popular way is COPY and it is built into the psql tool. In general it supports \copy wrapper, and there is COPY FROM STDIN. However, it is not available to the files that are not following the csv-like structure. I had to use it for XML and huge JSON files before, and it's always `sed` before the import and a replace() after.
pg_read_file does not help on cloud and managed installs of postgres here.
What I would prefer is some new COPY mode like RAW that will just push whatever it gets on the stdin/input into the cell on the server side. This way it can be proxied by psql, utilize existing infra for passing streams and be used in shell scripting.
On Thu, May 6, 2021 at 9:14 AM Joel Jacobson <joel@compiler.org> wrote:
On Wed, May 5, 2021, at 20:45, Tom Lane wrote:"Joel Jacobson" <joel@compiler.org> writes:> I think you misunderstood the problem.> I don't want the entire file to be considered a single value.> I want each line to become its own row, just a row with a single column.> So I actually think COPY seems like a perfect match for the job,> since it does precisely that, except there is no delimiter in this case.Well, there's more to it than just the column delimiter.* What about \N being converted to NULL?* What about \. being treated as EOF?* Do you want to turn off the special behavior of backslash (ESCAPE)altogether?* What about newline conversions (\r\n being seen as just \n, etc)?I'm inclined to think that "use pg_read_file and then split at newlines"might be a saner answer than delving into all these fine points.Not least because people yell when you add cycles to the COPYinner loops.Thanks for providing strong arguments why the COPY approach is a dead-end, I agree.However, as demonstrated in my previous email, usingstring_to_table(pg_read_file( filename ), E'\n')has its performance as well as max size issues.Maybe these two problems could be solved by combining the two functions into one?file_to_table ( filename text, delimiter text [, null_string text ] ) → setof textI'm thinking thanks to returning "setof text", such a function could read a stream,and return a line as soon as a delimiter is encountered, not having to keepthe entire file in memory at any time./Joel
On Thu, 6 May 2021 at 02:21, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:
What I would prefer is some new COPY mode like RAW that will just push whatever it gets on the stdin/input into the cell on the server side. This way it can be proxied by psql, utilize existing infra for passing streams and be used in shell scripting.
Yes! A significant missing feature is “take this arbitrary bucket of bits and move it to/from the database from/to this file without modification of any kind”. There are all sorts of tutorials on the Web about how to almost do it or fake it or convert to hex or whatever but I’ve never seen a nice simple explanation of “here is the command:”.
Of course there is a lot to think about. Coming out of the database, the query result must be exactly one row containing exactly one column; how do we handle other database results? Coming into the database, only one data value can come from a single file; so how do we populate the other columns of whatever table we copy to?
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
"Joel Jacobson"
Date:
On Thu, May 6, 2021, at 13:41, Isaac Morland wrote:
On Thu, 6 May 2021 at 02:21, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:What I would prefer is some new COPY mode like RAW that will just push whatever it gets on the stdin/input into the cell on the server side. This way it can be proxied by psql, utilize existing infra for passing streams and be used in shell scripting.Yes! A significant missing feature is “take this arbitrary bucket of bits and move it to/from the database from/to this file without modification of any kind”. There are all sorts of tutorials on the Web about how to almost do it or fake it or convert to hex or whatever but I’ve never seen a nice simple explanation of “here is the command:”.Of course there is a lot to think about. Coming out of the database, the query result must be exactly one row containing exactly one column; how do we handle other database results? Coming into the database, only one data value can come from a single file; so how do we populate the other columns of whatever table we copy to?
If the file is on the server, you can use pg_read_binary_file() for that purpose.
/Joel
On Thu, 6 May 2021 at 12:02, Joel Jacobson <joel@compiler.org> wrote:
On Thu, May 6, 2021, at 13:41, Isaac Morland wrote:Yes! A significant missing feature is “take this arbitrary bucket of bits and move it to/from the database from/to this file without modification of any kind”. There are all sorts of tutorials on the Web about how to almost do it or fake it or convert to hex or whatever but I’ve never seen a nice simple explanation of “here is the command:”.Of course there is a lot to think about. Coming out of the database, the query result must be exactly one row containing exactly one column; how do we handle other database results? Coming into the database, only one data value can come from a single file; so how do we populate the other columns of whatever table we copy to?If the file is on the server, you can use pg_read_binary_file() for that purpose.
Yes, sorry, I should have explicitly said “on the client” somewhere up there. Getting files from the DB server into the DB is no problem.
On 5/6/21 7:41 AM, Isaac Morland wrote: > On Thu, 6 May 2021 at 02:21, Darafei "Komяpa" Praliaskouski > <me@komzpa.net <mailto:me@komzpa.net>> wrote: > > > What I would prefer is some new COPY mode like RAW that will just > push whatever it gets on the stdin/input into the cell on the > server side. This way it can be proxied by psql, utilize existing > infra for passing streams and be used in shell scripting. > > > Yes! A significant missing feature is “take this arbitrary bucket of > bits and move it to/from the database from/to this file without > modification of any kind”. There are all sorts of tutorials on the Web > about how to almost do it or fake it or convert to hex or whatever but > I’ve never seen a nice simple explanation of “here is the command:”. > > Of course there is a lot to think about. Coming out of the database, > the query result must be exactly one row containing exactly one > column; how do we handle other database results? Coming into the > database, only one data value can come from a single file; so how do > we populate the other columns of whatever table we copy to? > What I'd like is something a bit more general. We could extend the FDW API to allow for a CopyStdin handler or some such. The input could be sent unfiltered to the handler, which would do whatever it liked with it. That way FDWs like file_fdw and file_text_array_fdw could read from stdin, for example. I'm pretty sure it would handle the OP's use case. The downside is you'd need to write an FDW handler, but that's not too hard, and there are lots of examples. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
"Daniel Verite"
Date:
Darafei "Komяpa" Praliaskouski wrote: > What I would prefer is some new COPY mode like RAW that will just push > whatever it gets on the stdin/input into the cell on the server side. This > way it can be proxied by psql, utilize existing infra for passing streams > and be used in shell scripting. COPY RAW has been proposed and discussed quite a bit previously: https://commitfest.postgresql.org/12/676/ Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'
From
"Joel Jacobson"
Date:
What about:
COPY ... FROM ... WITH PATTERN 'regexp_pattern'
Where the columns would be matched with the capture groups.
This could handle the quite common case of varying white-space as column separators:
COPY log (col1, col2, col3) FROM 'log.txt' WITH PATTERN '^(\S+)\s+(\S+)\s+(\S+)$'
This could also handle $SUBJECT:
COPY table_name (single_column) FROM 'unknown.txt' WITH PATTERN '^(.*)$';
And lots of other more complex use-cases.
/Joel