Thread: Bulk Load Ignore/Skip Feature

Bulk Load Ignore/Skip Feature

From
Willem Buitendyk
Date:
Will Postgresql ever implement an ignore on error feature when bulk
loading data?  Currently it is my understanding that any record that
violates a unique constraint will cause the "copy from" command to halt
execution instead of skipping over the violation and logging it - as is
done in Oracle and DB2.

Are there alternative ways of dealing with this scenario that won't
consume as much time?

Appreciate any help  - would love to migrate away from Oracle.

Cheers



Re: Bulk Load Ignore/Skip Feature

From
Reg Me Please
Date:
Il Wednesday 14 November 2007 05:50:36 Willem Buitendyk ha scritto:
> Will Postgresql ever implement an ignore on error feature when bulk
> loading data?  Currently it is my understanding that any record that
> violates a unique constraint will cause the "copy from" command to halt
> execution instead of skipping over the violation and logging it - as is
> done in Oracle and DB2.
>
> Are there alternative ways of dealing with this scenario that won't
> consume as much time?
>
> Appreciate any help  - would love to migrate away from Oracle.
>
> Cheers

pgloader

http://pgfoundry.org/projects/pgloader/

--
Reg me Please

Re: Bulk Load Ignore/Skip Feature

From
Willem Buitendyk
Date:
Perfect - that appears to be exactly what I was looking for.

Cheers

Reg Me Please wrote:
> Il Wednesday 14 November 2007 05:50:36 Willem Buitendyk ha scritto:
>
>> Will Postgresql ever implement an ignore on error feature when bulk
>> loading data?  Currently it is my understanding that any record that
>> violates a unique constraint will cause the "copy from" command to halt
>> execution instead of skipping over the violation and logging it - as is
>> done in Oracle and DB2.
>>
>> Are there alternative ways of dealing with this scenario that won't
>> consume as much time?
>>
>> Appreciate any help  - would love to migrate away from Oracle.
>>
>> Cheers
>>
>
> pgloader
>
> http://pgfoundry.org/projects/pgloader/
>
>


Re: Bulk Load Ignore/Skip Feature

From
David Fetter
Date:
On Tue, Nov 13, 2007 at 08:50:36PM -0800, Willem Buitendyk wrote:
> Will Postgresql ever implement an ignore on error feature when bulk
> loading data?  Currently it is my understanding that any record that
> violates a unique constraint will cause the "copy from" command to
> halt execution instead of skipping over the violation and logging it
> - as is done in Oracle and DB2.
>
> Are there alternative ways of dealing with this scenario that won't
> consume as much time?
>
> Appreciate any help  - would love to migrate away from Oracle.

You might try pgloader. :)

http://pgfoundry.org/projects/pgloader/

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Bulk Load Ignore/Skip Feature

From
Willem Buitendyk
Date:
Damn - so the unqiue contraint is still an issue.  What gives?  Why is
it so hard to implement this in Postgresql?  sigh - if only I had more time.


Ow Mun Heng wrote:
> On Wed, 2007-11-14 at 00:02 -0800, Willem Buitendyk wrote:
>
>> Perfect - that appears to be exactly what I was looking for.
>>
>
>
>> Reg Me Please wrote:
>>
>>> Il Wednesday 14 November 2007 05:50:36 Willem Buitendyk ha scritto:
>>>
>>>
>>>> Will Postgresql ever implement an ignore on error feature when bulk
>>>> loading data?  Currently it is my understanding that any record that
>>>> violates a unique constraint will cause the "copy from" command to halt
>>>> execution instead of skipping over the violation and logging it - as is
>>>> done in Oracle and DB2.
>>>>
>>> pgloader
>>>
>>> http://pgfoundry.org/projects/pgloader/
>>>
>>>
>>>
>
> I believe the last time I tried this, there was still some issues with
> it. See attached email. (if it makes it to the list)
>
>
>
>
> __________ NOD32 2657 (20071114) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
> ------------------------------------------------------------------------
>
> Subject:
> PgLoader unable to handle pkey dups Was [Re: {Spam} [GENERAL] pgloader
> - Can't find textreader/csvreader]
> From:
> Ow Mun Heng <Ow.Mun.Heng@wdc.com>
> Date:
> Mon, 27 Aug 2007 18:01:54 +0800
> To:
> Dimitri Fontaine <dfontaine@hi-media.com>
>
> To:
> Dimitri Fontaine <dfontaine@hi-media.com>
> CC:
> pgsql-general@postgresql.org
>
>
> On Mon, 2007-08-27 at 11:27 +0200, Dimitri Fontaine wrote:
>
>
>> We've just made some tests here with 2.2.1 and as this release contains the
>> missing files, it works fine without any installation.
>>
>
> Yep.. I can confirm that it works.. I am using the csv example.
>
> Goal : similar functionality much like mysql's mysqlimport --replace
> (overwrite any rows which has duplicate primary keys)
>
> $ psql pgloader < csv/csv.sql
> $ ../pgloader.py -Tvc examples/pgloader.conf csv
>
> pgloader=# alter table csv add primary key (a,b,c);
> pgloader=# \d csv
>         Table "public.csv"
>  Column |     Type     | Modifiers
> --------+--------------+-----------
>  a      | bigint       | not null
>  b      | bigint       | not null
>  c      | character(2) | not null
>  d      | text         |
> Indexes:
>     "csv_pkey" PRIMARY KEY, btree (a, b, c)
>
> pgloader=# select * from csv;
>     a     |    b     | c  |       d
> ----------+----------+----+----------------
>  33996344 | 33996351 | GB | United Kingdom
>  50331648 | 68257567 | US | United States
>  68257568 | 68257599 | CA | Canada
>  68257600 | 68259583 | US | United States
>  68259584 | 68259599 | CA | Canada
>
> $cat csv/csv.data
> "2.6.190.56","2.6.190.63","33996344","33996351","GB","Error Kingdom"
> "4.17.143.0","4.17.143.15","68259584","68259599","CA","new Country"
> <Note : only columns 3 to 6 are taken for loading)
>
> $ psql pgloader < csv/csv.sql
> $ ../pgloader.py -vc pgloader.conf csv
> Using pgloader.conf configuration file
> Will consider following sections:
>   csv
>
> [csv] parse configuration
> Notice: reject log in /tmp/csv.rej.log
> Notice: rejected data in /tmp/csv.rej
> [csv] data import
> Notice: COPY csv data
>
> Error: Please check PostgreSQL logs
> HINT:  double check your client_encoding, datestyle and copy_delimiter
> settings
>
> $sudo tail -f /var/log/pglog/postgresxx-xx-xx.log
> ERROR:  duplicate key violates unique constraint "csv_pkey"
> CONTEXT:  COPY csv, line 1: "33996344,33996351,Error Kingdom,GB"
> STATEMENT:  COPY csv (a, b, d, c)  FROM stdin USING DELIMITERS ','
>
> So.. doesn't really solve my issue.
> Dang it..
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


Re: Bulk Load Ignore/Skip Feature

From
Martijn van Oosterhout
Date:
On Thu, Nov 15, 2007 at 08:09:46PM -0800, Willem Buitendyk wrote:
> Damn - so the unqiue contraint is still an issue.  What gives?  Why is
> it so hard to implement this in Postgresql?  sigh - if only I had more time.

Can you explain? The server ofcourse still generates error messages in
the logs, there's no way around that. However it looks to me that the
data ended up in the database correctly? Or did I miss something?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Attachment

Re: Bulk Load Ignore/Skip Feature

From
Willem Buitendyk
Date:
My apologies.  I misinterpreted that last post.  I have not been able to
try pgloader as I am using the windows platform.

Martijn van Oosterhout wrote:
> On Thu, Nov 15, 2007 at 08:09:46PM -0800, Willem Buitendyk wrote:
>
>> Damn - so the unqiue contraint is still an issue.  What gives?  Why is
>> it so hard to implement this in Postgresql?  sigh - if only I had more time.
>>
>
> Can you explain? The server ofcourse still generates error messages in
> the logs, there's no way around that. However it looks to me that the
> data ended up in the database correctly? Or did I miss something?
>
> Have a nice day,
>


Re: Bulk Load Ignore/Skip Feature

From
Dimitri Fontaine
Date:
Hi all,

Le Friday 16 November 2007 18:04:44 Willem Buitendyk, vous avez écrit :
> Martijn van Oosterhout wrote:
> > On Thu, Nov 15, 2007 at 08:09:46PM -0800, Willem Buitendyk wrote:
> >> Damn - so the unqiue contraint is still an issue.  What gives?  Why is
> >> it so hard to implement this in Postgresql?  sigh - if only I had more
> >> time.
> >
> > Can you explain? The server ofcourse still generates error messages in
> > the logs, there's no way around that. However it looks to me that the
> > data ended up in the database correctly? Or did I miss something?

pgloader will load non-conflicting data and produce both a reject log with
errors about non inserted (COPYied) data and a reject data file with the
input line ready to be processed again if such is the operator choice.

But at the moment it does not provide any way to automate the UPDATE the PK
conflicting rows. I'm really hesitant as to code this option: what to do in
the case of a non primary key unique constraint conflict:

dim=# create table unic(a integer unique);
dim=# insert into unic values (1);
INSERT 2312559 1
dim=# insert into unic values (1);
ERROR:  duplicate key violates unique constraint "unic_a_key"

dim=# create table pk(a integer primary key);
dim=# insert into pk values (1);
INSERT 2312565 1
dim=# insert into pk values (1);
ERROR:  duplicate key violates unique constraint "pk_pkey"

I'm thinking maybe in the first case you don't want existing values to be
overwritten, but in the second case it's what you want to happen. Should this
be the user responsibility to make the difference --- by configuring pgloader
properly --- or should the tool try hard to protect the user against himself?

How to act on a table with a surrogate pk and a unique constraint when you
want to automatically update surrogate key but not the unique data, or the
other way around?

So I have two questions for the community:
 - should I provide a pgloader mailing list?
 - what do you think about adding the UPDATE-on-duplicate-key-error option?

> My apologies.  I misinterpreted that last post.  I have not been able to
> try pgloader as I am using the windows platform.

pgloader is a python "script" which depends on psycopg for handling the
PostgreSQL connection, and only standard python modules after that. The
following link provides windows binaries for psycopg.
  http://www.stickpeople.com/projects/python/win-psycopg/

I've gotten reports of pgloader running on windows, even if I didn't make any
specific effort for this to happen and I don't have any proprietary licenced
OS to test pgloader on.

Hope this helps,
--
dim

Re: Bulk Load Ignore/Skip Feature

From
Ow Mun Heng
Date:
On Wed, 2007-11-14 at 00:02 -0800, Willem Buitendyk wrote:
> Perfect - that appears to be exactly what I was looking for.

> Reg Me Please wrote:
> > Il Wednesday 14 November 2007 05:50:36 Willem Buitendyk ha scritto:
> >
> >> Will Postgresql ever implement an ignore on error feature when bulk
> >> loading data?  Currently it is my understanding that any record that
> >> violates a unique constraint will cause the "copy from" command to halt
> >> execution instead of skipping over the violation and logging it - as is
> >> done in Oracle and DB2.
> >
> > pgloader
> >
> > http://pgfoundry.org/projects/pgloader/
> >
> >

I believe the last time I tried this, there was still some issues with
it. See attached email. (if it makes it to the list)

On Mon, 2007-08-27 at 11:27 +0200, Dimitri Fontaine wrote:

> We've just made some tests here with 2.2.1 and as this release contains the
> missing files, it works fine without any installation.

Yep.. I can confirm that it works.. I am using the csv example.

Goal : similar functionality much like mysql's mysqlimport --replace
(overwrite any rows which has duplicate primary keys)

$ psql pgloader < csv/csv.sql
$ ../pgloader.py -Tvc examples/pgloader.conf csv

pgloader=# alter table csv add primary key (a,b,c);
pgloader=# \d csv
        Table "public.csv"
 Column |     Type     | Modifiers
--------+--------------+-----------
 a      | bigint       | not null
 b      | bigint       | not null
 c      | character(2) | not null
 d      | text         |
Indexes:
    "csv_pkey" PRIMARY KEY, btree (a, b, c)

pgloader=# select * from csv;
    a     |    b     | c  |       d
----------+----------+----+----------------
 33996344 | 33996351 | GB | United Kingdom
 50331648 | 68257567 | US | United States
 68257568 | 68257599 | CA | Canada
 68257600 | 68259583 | US | United States
 68259584 | 68259599 | CA | Canada

$cat csv/csv.data
"2.6.190.56","2.6.190.63","33996344","33996351","GB","Error Kingdom"
"4.17.143.0","4.17.143.15","68259584","68259599","CA","new Country"
<Note : only columns 3 to 6 are taken for loading)

$ psql pgloader < csv/csv.sql
$ ../pgloader.py -vc pgloader.conf csv
Using pgloader.conf configuration file
Will consider following sections:
  csv

[csv] parse configuration
Notice: reject log in /tmp/csv.rej.log
Notice: rejected data in /tmp/csv.rej
[csv] data import
Notice: COPY csv data

Error: Please check PostgreSQL logs
HINT:  double check your client_encoding, datestyle and copy_delimiter
settings

$sudo tail -f /var/log/pglog/postgresxx-xx-xx.log
ERROR:  duplicate key violates unique constraint "csv_pkey"
CONTEXT:  COPY csv, line 1: "33996344,33996351,Error Kingdom,GB"
STATEMENT:  COPY csv (a, b, d, c)  FROM stdin USING DELIMITERS ','

So.. doesn't really solve my issue.
Dang it..


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings