Home > mailing lists

Thread: importing a messy text file

importing a messy text file

From

Willy-Bas Loos

Date:

30 April 2014, 08:07:14

Hi,

I have a 56GB textfile that i want to import into postgres.

The file is tab delimited and not quoted.

I deleted the header with the column names (using sed) so that i could use COPY with the non-csv text type (because some of the text values contain quotes).

I had some minor trouble with the file which i managed, but now i have one where i can't think of a solution, even though it seems so simple.

The problem is this:

There is a tab after the last column, in many but not all records.

When i ran into the extra tab i added a dummy column in the destination table but now COPY thows an error because the data for the dummy column is missing (on record ~275K of about 150M).

The file is too big to edit by hand and anyway it would probably not be feasible to manually add tabs for every record that misses one, although i don't know how many it would be.

I realize that there could be other showstoppers in the file, like missing tabs in the middle or extra tabs in the middle, but i would like to try and get this fixed.

Maybe it would be feasible to add every record as 1 value and then splitting those into columns using postgres text processing.

Or maybe there is an (undocumented?) option in copy or \copy to ignore extra columns.

Or maybe there is some no-sql software where i can import this and then structure the data before i pass it to postgres..

Do you have any tips, please?

Cheers,

Willy-Bas Loos

Re: importing a messy text file

From

Karsten Hilbert

Date:

30 April 2014, 08:13:54

On Wed, Apr 30, 2014 at 10:07:09AM +0200, Willy-Bas Loos wrote:

> it would probably not be
> feasible to manually add tabs for every record that misses one, although i
> don't know how many it would be.

Why not a) let a script do that, b) remove tabs from the end
of lines as needed ?

All in all it sounds like you want to use a scripting
language to sanitize the file.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: importing a messy text file

From

Alberto Cabello Sánchez

Date:

30 April 2014, 08:13:59

On Wed, 30 Apr 2014 10:07:09 +0200
Willy-Bas Loos <willybas@gmail.com> wrote:

> Hi,
>
> I have a 56GB textfile that i want to import into postgres.

!!!

> The problem is this:
> There is a tab after the last column, in many but not all records.

You could clean it up with something like s/^I$// , don't you?

--
Alberto Cabello Sánchez
<alberto@unex.es>

Re: importing a messy text file

From

Willy-Bas Loos

Date:

30 April 2014, 08:47:19

Alberto, it would be hard to use sed( s/^I$// ), because there is no explicit NULL value and there are many NULLs in the last column.

So i can't be sure how many tabs should be in the end of each line.

Yes, Karsten, maybe scripting would be the easiest way to fix this, i would then probably insert this line for line. That's a possibility. It might take a long time to run, but that's not much of an issue.

Willy-Bas Loos

Re: importing a messy text file

From

Alberto Cabello Sánchez

Date:

30 April 2014, 09:03:53

On Wed, 30 Apr 2014 10:47:12 +0200
Willy-Bas Loos <willybas@gmail.com> wrote:

> Alberto, it would be hard to use sed( s/^I$// ), because there is no
> explicit NULL value and there are many NULLs in the last column.
> So i can't be sure how many tabs should be in the end of each line.

Ok, I understand (I think).

What about using "cut" to strip the extra fields?

--
Alberto Cabello Sánchez
<alberto@unex.es>

Re: importing a messy text file

From

Rémi Cura

Date:

30 April 2014, 09:06:24

Hey,
with latest version 9.3 you can use a copy from with a programm as argument.

I had a similar problem (an extra space at the end of each line), so I used sed in a pipeline to remove it before feeding it to database.

You can use the unix | pipe for most of the stuff.

If you have an older version you can use the mkfifo command to "trick" postgres into believing he will copy from a file.

Cheers,
Rémi-C

2014-04-30 10:47 GMT+02:00 Willy-Bas Loos <willybas@gmail.com>:

Alberto, it would be hard to use sed( s/^I$// ), because there is no explicit NULL value and there are many NULLs in the last column.
So i can't be sure how many tabs should be in the end of each line.

Yes, Karsten, maybe scripting would be the easiest way to fix this, i would then probably insert this line for line. That's a possibility. It might take a long time to run, but that's not much of an issue.

--
Willy-Bas Loos

Re: importing a messy text file

From

Willy-Bas Loos

Date:

30 April 2014, 09:41:48

On Wed, Apr 30, 2014 at 11:03 AM, Alberto Cabello Sánchez <alberto@unex.es> wrote:

What about using "cut" to strip the extra fields?

Wow, i didn't know "cut", this looks promising.

thanks.

Willy-Bas Loos

Re: importing a messy text file

From

Willy-Bas Loos

Date:

30 April 2014, 09:50:16

On Wed, Apr 30, 2014 at 11:06 AM, Rémi Cura <remi.cura@gmail.com> wrote:

with latest version 9.3 you can use a copy from with a programm as argument.
You can use the unix | pipe for most of the stuff.

If you have an older version you can use the mkfifo command to "trick" postgres into believing he will copy from a file.

Hi, you're right i can see the new feature in the docs.

But since i am working from the (bash) command line already, i don't see any use for this.

I already pass the data like this:
cat <file> |psql -c "copy <table> from stdin NULL ''" <db> > <file>.log 2>&1

It's especially handy if you want to stay in sql(transaction), i gues.

Cheers,

Willy-Bas Loos

Re: importing a messy text file

From

Rob Sargentg

Date:

30 April 2014, 13:19:30

On 04/30/2014 03:50 AM, Willy-Bas Loos wrote:

On Wed, Apr 30, 2014 at 11:06 AM, Rémi Cura <remi.cura@gmail.com> wrote:
with latest version 9.3 you can use a copy from with a programm as argument.
You can use the unix | pipe for most of the stuff.

If you have an older version you can use the mkfifo command to "trick" postgres into believing he will copy from a file.

Hi, you're right i can see the new feature in the docs.
But since i am working from the (bash) command line already, i don't see any use for this.
I already pass the data like this:
cat <file> |psql -c "copy <table> from stdin NULL ''" <db> > <file>.log 2>&1

It's especially handy if you want to stay in sql(transaction), i gues.

Cheers,

--
Willy-Bas Loos

Unless you know this is the only problem (extra/missing tab) I would triage the file with a scripting language. One favourite check is

awk '{a[NF]++}END{print "Field count: Record count";for (i in a){printf "%11d : %d", i, a[i]}}

If you only have the tab problem you will get two lines of output with field count = N, N-1. Take care in setting the field separator.

Re: importing a messy text file

From

bricklen

Date:

30 April 2014, 16:30:20

On Wed, Apr 30, 2014 at 1:07 AM, Willy-Bas Loos <willybas@gmail.com> wrote:

Hi,

I have a 56GB textfile that i want to import into postgres.
The file is tab delimited and not quoted.

Would Pgloader be an option? http://tapoueh.org/pgloader/

Re: importing a messy text file

From

Michael Paquier

Date:

01 May 2014, 04:02:22

On Thu, May 1, 2014 at 1:30 AM, bricklen <bricklen@gmail.com> wrote:
>
> On Wed, Apr 30, 2014 at 1:07 AM, Willy-Bas Loos <willybas@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> I have a 56GB textfile that i want to import into postgres.
>> The file is tab delimited and not quoted.
>
>
> Would Pgloader be an option? http://tapoueh.org/pgloader/
Or pg_bulkload? Contrary to the in-core COPY, it can filter
"incorrect" tuple data at the line level.
--
Michael