Thread: insert only unique values in to a table, ignore rest?

insert only unique values in to a table, ignore rest?

From
George Nychis
Date:
Hi,

I have approximately 2 billion data entries that I would like to insert into a database.
Each entry consists of:
INT BOOLEAN INT BOOLEAN

I want to populate a table such that it only contains the unique rows, all other data
should be thrown out.  I would say a significant amount of the insertions are going to
fail due to unique constraints.  The unique constraint is on the two integers, not on the
booleans.

Using mysql, I was able to do this with the following query, for all data files (25574
data files total):
mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
FIELDS TERMINATED BY ' ';\"

What I *think* mysql did was sort each data file and do a sort of merge sort between the
data I was inserting and the data in the database.  It would insert the first unique
instance of a row it saw, and reject all other insertions that violated the unique
constraint due to the "IGNORE".

From what I understand, this functionality is not in postgresql.  Fine, I certainly can't
change that.  But I am looking for a comparable solution for the size of my data.

One solution is to have a temporary table, insert all 2 billion rows, and then copy the
distinct entries to another table.  This would be like one massive sort?

Is this the only/best solution using postgresql?

Thanks!
George

Re: insert only unique values in to a table, ignore rest?

From
Scott Marlowe
Date:
On Mon, 2007-01-08 at 14:58, George Nychis wrote:
> Hi,
>
> I have approximately 2 billion data entries that I would like to insert into a database.
> Each entry consists of:
> INT BOOLEAN INT BOOLEAN
>
> I want to populate a table such that it only contains the unique rows, all other data
> should be thrown out.  I would say a significant amount of the insertions are going to
> fail due to unique constraints.  The unique constraint is on the two integers, not on the
> booleans.
>
> Using mysql, I was able to do this with the following query, for all data files (25574
> data files total):
> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
> FIELDS TERMINATED BY ' ';\"

A quick question.  Could you run selects or other inserts on that table
while the load data infile was running?  Cause I'm guessing that it
basically locked the whole table while running.

> What I *think* mysql did was sort each data file and do a sort of merge sort between the
> data I was inserting and the data in the database.  It would insert the first unique
> instance of a row it saw, and reject all other insertions that violated the unique
> constraint due to the "IGNORE".

Me too.  Which would require "one big lock" on the table which would
mean no parallel access.

It's also likely that it used a temp table which doubled the size of the
database while you were inserting.

> From what I understand, this functionality is not in postgresql.  Fine, I certainly can't
> change that.  But I am looking for a comparable solution for the size of my data.
>
> One solution is to have a temporary table, insert all 2 billion rows, and then copy the
> distinct entries to another table.  This would be like one massive sort?
>
> Is this the only/best solution using postgresql?

TANSTAAFL.  PostgreSQL is designed so that you can run an import process
on that table while 100 other users still access it at the same time.
Because of that, you don't get to do dirty, nasty things under the
sheets that allow for super easy data loading and merging like you got
with MySQL.  Apples and Oranges.

Assuming you're loading into an empty table, the load to temp, select
distinct out and into the final table seems reasonable, should run
reasonably fast.  If you need to load to an existing table, it might get
a little more complex.

Re: insert only unique values in to a table, ignore rest?

From
George Nychis
Date:

Scott Marlowe wrote:
> On Mon, 2007-01-08 at 14:58, George Nychis wrote:
>> Hi,
>>
>> I have approximately 2 billion data entries that I would like to insert into a database.
>> Each entry consists of:
>> INT BOOLEAN INT BOOLEAN
>>
>> I want to populate a table such that it only contains the unique rows, all other data
>> should be thrown out.  I would say a significant amount of the insertions are going to
>> fail due to unique constraints.  The unique constraint is on the two integers, not on the
>> booleans.
>>
>> Using mysql, I was able to do this with the following query, for all data files (25574
>> data files total):
>> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
>> FIELDS TERMINATED BY ' ';\"
>
> A quick question.  Could you run selects or other inserts on that table
> while the load data infile was running?  Cause I'm guessing that it
> basically locked the whole table while running.
What does this have to do with my question?  I don't need to run selects or inserts on the
table while the load data is running...

>
>> What I *think* mysql did was sort each data file and do a sort of merge sort between the
>> data I was inserting and the data in the database.  It would insert the first unique
>> instance of a row it saw, and reject all other insertions that violated the unique
>> constraint due to the "IGNORE".
>
> Me too.  Which would require "one big lock" on the table which would
> mean no parallel access.
Thats fine, it doesn't matter.

>
> It's also likely that it used a temp table which doubled the size of the
> database while you were inserting.
>
>> From what I understand, this functionality is not in postgresql.  Fine, I certainly can't
>> change that.  But I am looking for a comparable solution for the size of my data.
>>
>> One solution is to have a temporary table, insert all 2 billion rows, and then copy the
>> distinct entries to another table.  This would be like one massive sort?
>>
>> Is this the only/best solution using postgresql?
>
> TANSTAAFL.  PostgreSQL is designed so that you can run an import process
> on that table while 100 other users still access it at the same time.
> Because of that, you don't get to do dirty, nasty things under the
> sheets that allow for super easy data loading and merging like you got
> with MySQL.  Apples and Oranges.
>
> Assuming you're loading into an empty table, the load to temp, select
> distinct out and into the final table seems reasonable, should run
> reasonably fast.  If you need to load to an existing table, it might get
> a little more complex.
>

The goal is not to run queries while the data is being inserted....I am wondering if the
postgresql method I have mentioned to actually insert and get only distinct values is most
optimal, which would produce the same results method I explained in mysql.

Re: insert only unique values in to a table, ignore rest?

From
Scott Marlowe
Date:
On Mon, 2007-01-08 at 15:52, George Nychis wrote:
> Scott Marlowe wrote:
> > On Mon, 2007-01-08 at 14:58, George Nychis wrote:
> >> Hi,
> >>
> >> I have approximately 2 billion data entries that I would like to insert into a database.
> >> Each entry consists of:
> >> INT BOOLEAN INT BOOLEAN
> >>
> >> I want to populate a table such that it only contains the unique rows, all other data
> >> should be thrown out.  I would say a significant amount of the insertions are going to
> >> fail due to unique constraints.  The unique constraint is on the two integers, not on the
> >> booleans.
> >>
> >> Using mysql, I was able to do this with the following query, for all data files (25574
> >> data files total):
> >> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
> >> FIELDS TERMINATED BY ' ';\"
> >
> > A quick question.  Could you run selects or other inserts on that table
> > while the load data infile was running?  Cause I'm guessing that it
> > basically locked the whole table while running.
> What does this have to do with my question?  I don't need to run selects or inserts on the
> table while the load data is running...
>
> >
> >> What I *think* mysql did was sort each data file and do a sort of merge sort between the
> >> data I was inserting and the data in the database.  It would insert the first unique
> >> instance of a row it saw, and reject all other insertions that violated the unique
> >> constraint due to the "IGNORE".
> >
> > Me too.  Which would require "one big lock" on the table which would
> > mean no parallel access.
> Thats fine, it doesn't matter.
>
> >
> > It's also likely that it used a temp table which doubled the size of the
> > database while you were inserting.
> >
> >> From what I understand, this functionality is not in postgresql.  Fine, I certainly can't
> >> change that.  But I am looking for a comparable solution for the size of my data.
> >>
> >> One solution is to have a temporary table, insert all 2 billion rows, and then copy the
> >> distinct entries to another table.  This would be like one massive sort?
> >>
> >> Is this the only/best solution using postgresql?
> >
> > TANSTAAFL.  PostgreSQL is designed so that you can run an import process
> > on that table while 100 other users still access it at the same time.
> > Because of that, you don't get to do dirty, nasty things under the
> > sheets that allow for super easy data loading and merging like you got
> > with MySQL.  Apples and Oranges.
> >
> > Assuming you're loading into an empty table, the load to temp, select
> > distinct out and into the final table seems reasonable, should run
> > reasonably fast.  If you need to load to an existing table, it might get
> > a little more complex.
> >
>
> The goal is not to run queries while the data is being inserted....I am wondering if the
> postgresql method I have mentioned to actually insert and get only distinct values is most
> optimal, which would produce the same results method I explained in mysql.

Did I fail to answer your question?

Sorry if I gave you more information than you needed.  Please feel free
to ask someone else next time.

Re: insert only unique values in to a table, ignore rest?

From
George Nychis
Date:

Scott Marlowe wrote:
> On Mon, 2007-01-08 at 15:52, George Nychis wrote:
>> Scott Marlowe wrote:
>>> On Mon, 2007-01-08 at 14:58, George Nychis wrote:
>>>> Hi,
>>>>
>>>> I have approximately 2 billion data entries that I would like to insert into a database.
>>>> Each entry consists of:
>>>> INT BOOLEAN INT BOOLEAN
>>>>
>>>> I want to populate a table such that it only contains the unique rows, all other data
>>>> should be thrown out.  I would say a significant amount of the insertions are going to
>>>> fail due to unique constraints.  The unique constraint is on the two integers, not on the
>>>> booleans.
>>>>
>>>> Using mysql, I was able to do this with the following query, for all data files (25574
>>>> data files total):
>>>> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
>>>> FIELDS TERMINATED BY ' ';\"
>>> A quick question.  Could you run selects or other inserts on that table
>>> while the load data infile was running?  Cause I'm guessing that it
>>> basically locked the whole table while running.
>> What does this have to do with my question?  I don't need to run selects or inserts on the
>> table while the load data is running...
>>
>>>> What I *think* mysql did was sort each data file and do a sort of merge sort between the
>>>> data I was inserting and the data in the database.  It would insert the first unique
>>>> instance of a row it saw, and reject all other insertions that violated the unique
>>>> constraint due to the "IGNORE".
>>> Me too.  Which would require "one big lock" on the table which would
>>> mean no parallel access.
>> Thats fine, it doesn't matter.
>>
>>> It's also likely that it used a temp table which doubled the size of the
>>> database while you were inserting.
>>>
>>>> From what I understand, this functionality is not in postgresql.  Fine, I certainly can't
>>>> change that.  But I am looking for a comparable solution for the size of my data.
>>>>
>>>> One solution is to have a temporary table, insert all 2 billion rows, and then copy the
>>>> distinct entries to another table.  This would be like one massive sort?
>>>>
>>>> Is this the only/best solution using postgresql?
>>> TANSTAAFL.  PostgreSQL is designed so that you can run an import process
>>> on that table while 100 other users still access it at the same time.
>>> Because of that, you don't get to do dirty, nasty things under the
>>> sheets that allow for super easy data loading and merging like you got
>>> with MySQL.  Apples and Oranges.
>>>
>>> Assuming you're loading into an empty table, the load to temp, select
>>> distinct out and into the final table seems reasonable, should run
>>> reasonably fast.  If you need to load to an existing table, it might get
>>> a little more complex.
>>>
>> The goal is not to run queries while the data is being inserted....I am wondering if the
>> postgresql method I have mentioned to actually insert and get only distinct values is most
>> optimal, which would produce the same results method I explained in mysql.
>
> Did I fail to answer your question?
>
> Sorry if I gave you more information than you needed.  Please feel free
> to ask someone else next time.
>

ahhh i missed your last paragraph... so much text.  Actually yeah that answers my
question, thank you.  I guess its more a single run through than the mysql method which
was piece-wise.

Thanks for the help/response.

- George

Re: insert only unique values in to a table, ignore rest?

From
Scott Marlowe
Date:
On Mon, 2007-01-08 at 15:59, George Nychis wrote:
> Scott Marlowe wrote:
> > On Mon, 2007-01-08 at 15:52, George Nychis wrote:
> >> Scott Marlowe wrote:
> >>> On Mon, 2007-01-08 at 14:58, George Nychis wrote:
> >>>> Hi,
> >>>>
> >>>> I have approximately 2 billion data entries that I would like to insert into a database.
> >>>> Each entry consists of:
> >>>> INT BOOLEAN INT BOOLEAN
> >>>>

SNIP

> >>> Assuming you're loading into an empty table, the load to temp, select
> >>> distinct out and into the final table seems reasonable, should run
> >>> reasonably fast.  If you need to load to an existing table, it might get
> >>> a little more complex.
> >>>
> >> The goal is not to run queries while the data is being inserted....I am wondering if the
> >> postgresql method I have mentioned to actually insert and get only distinct values is most
> >> optimal, which would produce the same results method I explained in mysql.
> >
> > Did I fail to answer your question?
> >
> > Sorry if I gave you more information than you needed.  Please feel free
> > to ask someone else next time.
> >
>
> ahhh i missed your last paragraph... so much text.  Actually yeah that answers my
> question, thank you.  I guess its more a single run through than the mysql method which
> was piece-wise.

Note that things will go faster if you do your initial data load using
"copy from stdin" for the initial bulk data load.  individual inserts in
postgresql are quite costly compared to mysql.  It's the transactional
overhead.  by grouping them together you can make things much faster.
copy from stdin does all the inserts in one big transaction.

If you use insert statements, wrap them in a begin; end; pair to make
them be one transaction.  not as fast as copy, due to parsing, but still
much faster than individual transactions.

Re: insert only unique values in to a table, ignore rest?

From
"Jeremy Haile"
Date:
> Note that things will go faster if you do your initial data load using
> "copy from stdin" for the initial bulk data load.  individual inserts in
> postgresql are quite costly compared to mysql.  It's the transactional
> overhead.  by grouping them together you can make things much faster.
> copy from stdin does all the inserts in one big transaction.

You could do "copy from file" as well right?  (no performance difference
compared to "copy from stdin")  I do this all the time.

Also - maybe I misunderstand something, but why does PostgreSQL's
implementation prohibit it from ignoring insert errors during a copy?
If you added a unique constraint to the table before copying, PostgreSQL
would generate errors due to the unique constraint violation - so I
don't think any additional locking would be required for it to simply
say "If there is an error while copying in, ignore it and continue
inserting other rows"

PostgreSQL's copy command doesn't currently support this, so the temp
table followed by a distinct select is the way to go.  But I didn't
follow all of the talk about it requiring locking the table and being
inherently impossible for PostgreSQL to support.

I've wanted a similar feature.  I select rows into a table on a regular
basis.  I'd like to be able to overlap old values and have PostgreSQL
ignore failed inserts.  SQL Server offers a flag that allows you to
ignore inserts whose primary key already exists in the table.  The only
solution in PostgreSQL is to run a query to manually delete the
duplicate rows from a temp table before inserting - which takes much
more time.

Re: insert only unique values in to a table, ignore rest?

From
George Nychis
Date:

Jeremy Haile wrote:
>> Note that things will go faster if you do your initial data load using
>> "copy from stdin" for the initial bulk data load.  individual inserts in
>> postgresql are quite costly compared to mysql.  It's the transactional
>> overhead.  by grouping them together you can make things much faster.
>> copy from stdin does all the inserts in one big transaction.
>
> You could do "copy from file" as well right?  (no performance difference
> compared to "copy from stdin")  I do this all the time.
>
> Also - maybe I misunderstand something, but why does PostgreSQL's
> implementation prohibit it from ignoring insert errors during a copy?
> If you added a unique constraint to the table before copying, PostgreSQL
> would generate errors due to the unique constraint violation - so I
> don't think any additional locking would be required for it to simply
> say "If there is an error while copying in, ignore it and continue
> inserting other rows"
>
> PostgreSQL's copy command doesn't currently support this, so the temp
> table followed by a distinct select is the way to go.  But I didn't
> follow all of the talk about it requiring locking the table and being
> inherently impossible for PostgreSQL to support.
>
> I've wanted a similar feature.  I select rows into a table on a regular
> basis.  I'd like to be able to overlap old values and have PostgreSQL
> ignore failed inserts.  SQL Server offers a flag that allows you to
> ignore inserts whose primary key already exists in the table.  The only
> solution in PostgreSQL is to run a query to manually delete the
> duplicate rows from a temp table before inserting - which takes much
> more time.
>

I would also like this feature... :) (obviously)

I also didn't exactly follow the locking, I don't need it as far as I know.

- George