Thread: problem with large inserts

problem with large inserts

From
Lutz Fischer
Date:
Hi

I have currently some trouble with inserts into a table

INSERT INTO LPP (PPID, LID)
SELECT DISTINCT PPid, LID FROM
        (SELECT * FROM PP WHERE s_id = sid) pp
            INNER JOIN
        has_protein hp1
            ON pp.p1id = hp1.pid
            INNER JOIN
        has_protein hp2
            ON pp.p2_id = hp2.pid
            INNER JOIN
        (SELECT * FROM L WHERE s_id = sid) l
            ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
            OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
    ;

If I run only

SELECT DISTINCT PPid, LID FROM
        (SELECT * FROM PP WHERE s_id = 708) pp
            INNER JOIN
        has_protein hp1
            ON pp.p1id = hp1.pid
            INNER JOIN
        has_protein hp2
            ON pp.p2_id = hp2.pid
            INNER JOIN
        (SELECT * FROM L WHERE s_id = 708) l
            ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
            OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
    ;

it returns 200620 rows in 170649 ms  ( thats just under 3  minutes). I
stopped the actual insert after about 8h.

The table that the insert happens to, is following:
CREATE TABLE LPP
(
  ppid bigint NOT NULL,
  lid bigint NOT NULL,
  CONSTRAINT pk_lpp PRIMARY KEY (ppid,lid)
)

I also tried without the primary key but that one is still running for
more that a day.

Currently the table LPP holds 471139 rows. Its linking the PP and the L
table.

There are no foreign keys referring to that table nor are there any
other constraints on it.
Previously I had foreign keys on lid and ppid refering to the L and PP
table. But in a desperate try to get some speed up I deleted these. -
But still...

I am running postgresql 9.2 on a windows 2008  R2 server with 256 GB and
the database is on something like a raid 1+0 (actually a raid1e)
consisting of 3x4TB disks (limit of what could easily be fitted into the
server).

At the given time  there were no concurrent access to any of the
involved tables.

Has anybody some idea why the insert takes so long and/or how to speed
things up a bit? I could live with something like half an hour - better
would be in minutes.


Thanks for any responds,

Lutz Fischer


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: problem with large inserts

From
Willem Leenen
Date:
I would strongly discourage you from droppping the referential integrity. You risk data corruption, which will cost you a good deal of time to sort it out properly, and corruption prevents you to apply the R.I. again. Also it has hardly any performance impact.

Are the plans different? ( i guess you've looked at http://wiki.postgresql.org/wiki/Slow_Query_Questions ?)

> Date: Thu, 13 Dec 2012 15:37:33 +0000
> From: lfischer@staffmail.ed.ac.uk
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] problem with large inserts
>
> Hi
>
> I have currently some trouble with inserts into a table
>
> INSERT INTO LPP (PPID, LID)
> SELECT DISTINCT PPid, LID FROM
> (SELECT * FROM PP WHERE s_id = sid) pp
> INNER JOIN
> has_protein hp1
> ON pp.p1id = hp1.pid
> INNER JOIN
> has_protein hp2
> ON pp.p2_id = hp2.pid
> INNER JOIN
> (SELECT * FROM L WHERE s_id = sid) l
> ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
> pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
> OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
> pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
> ;
>
> If I run only
>
> SELECT DISTINCT PPid, LID FROM
> (SELECT * FROM PP WHERE s_id = 708) pp
> INNER JOIN
> has_protein hp1
> ON pp.p1id = hp1.pid
> INNER JOIN
> has_protein hp2
> ON pp.p2_id = hp2.pid
> INNER JOIN
> (SELECT * FROM L WHERE s_id = 708) l
> ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
> pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
> OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
> pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
> ;
>
> it returns 200620 rows in 170649 ms ( thats just under 3 minutes). I
> stopped the actual insert after about 8h.
>
> The table that the insert happens to, is following:
> CREATE TABLE LPP
> (
> ppid bigint NOT NULL,
> lid bigint NOT NULL,
> CONSTRAINT pk_lpp PRIMARY KEY (ppid,lid)
> )
>
> I also tried without the primary key but that one is still running for
> more that a day.
>
> Currently the table LPP holds 471139 rows. Its linking the PP and the L
> table.
>
> There are no foreign keys referring to that table nor are there any
> other constraints on it.
> Previously I had foreign keys on lid and ppid refering to the L and PP
> table. But in a desperate try to get some speed up I deleted these. -
> But still...
>
> I am running postgresql 9.2 on a windows 2008 R2 server with 256 GB and
> the database is on something like a raid 1+0 (actually a raid1e)
> consisting of 3x4TB disks (limit of what could easily be fitted into the
> server).
>
> At the given time there were no concurrent access to any of the
> involved tables.
>
> Has anybody some idea why the insert takes so long and/or how to speed
> things up a bit? I could live with something like half an hour - better
> would be in minutes.
>
>
> Thanks for any responds,
>
> Lutz Fischer
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: problem with large inserts

From
Filip Rembiałkowski
Date:
Just an idea - how long does it take to run _only_
CREATE TEMP TABLE foo AS <your SELECT here>




On Thu, Dec 13, 2012 at 4:37 PM, Lutz Fischer
<lfischer@staffmail.ed.ac.uk> wrote:
> Hi
>
> I have currently some trouble with inserts into a table
>
> INSERT INTO LPP (PPID, LID)
> SELECT DISTINCT PPid, LID FROM
>         (SELECT * FROM PP WHERE s_id = sid) pp
>             INNER JOIN
>         has_protein hp1
>             ON pp.p1id = hp1.pid
>             INNER JOIN
>         has_protein hp2
>             ON pp.p2_id = hp2.pid
>             INNER JOIN
>         (SELECT * FROM L WHERE s_id = sid) l
>             ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
> pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
>             OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
> pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
>     ;
>
> If I run only
>
> SELECT DISTINCT PPid, LID FROM
>         (SELECT * FROM PP WHERE s_id = 708) pp
>             INNER JOIN
>         has_protein hp1
>             ON pp.p1id = hp1.pid
>             INNER JOIN
>         has_protein hp2
>             ON pp.p2_id = hp2.pid
>             INNER JOIN
>         (SELECT * FROM L WHERE s_id = 708) l
>             ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
> pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
>             OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
> pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
>     ;
>
> it returns 200620 rows in 170649 ms  ( thats just under 3  minutes). I
> stopped the actual insert after about 8h.
>
> The table that the insert happens to, is following:
> CREATE TABLE LPP
> (
>   ppid bigint NOT NULL,
>   lid bigint NOT NULL,
>   CONSTRAINT pk_lpp PRIMARY KEY (ppid,lid)
> )
>
> I also tried without the primary key but that one is still running for
> more that a day.
>
> Currently the table LPP holds 471139 rows. Its linking the PP and the L
> table.
>
> There are no foreign keys referring to that table nor are there any
> other constraints on it.
> Previously I had foreign keys on lid and ppid refering to the L and PP
> table. But in a desperate try to get some speed up I deleted these. -
> But still...
>
> I am running postgresql 9.2 on a windows 2008  R2 server with 256 GB and
> the database is on something like a raid 1+0 (actually a raid1e)
> consisting of 3x4TB disks (limit of what could easily be fitted into the
> server).
>
> At the given time  there were no concurrent access to any of the
> involved tables.
>
> Has anybody some idea why the insert takes so long and/or how to speed
> things up a bit? I could live with something like half an hour - better
> would be in minutes.
>
>
> Thanks for any responds,
>
> Lutz Fischer
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: problem with large inserts

From
Tom Lane
Date:
Lutz Fischer <lfischer@staffmail.ed.ac.uk> writes:
> I have currently some trouble with inserts into a table
> If I run only [ the select part ]
> it returns 200620 rows in 170649 ms  ( thats just under 3  minutes). I
> stopped the actual insert after about 8h.

It should not take 8h to insert 200k rows on any machine made this
century.  Frankly, I'm wondering if the insert is doing anything at all,
or is blocked on a lock somewhere.  You say there's no concurrent
activity, but how hard did you look?  Did you check that, say, the
physical disk file for the table is growing?

> I am running postgresql 9.2 on a windows 2008  R2 server with 256 GB and
> the database is on something like a raid 1+0 (actually a raid1e)
> consisting of 3x4TB disks (limit of what could easily be fitted into the
> server).

A different line of thought is that there's something seriously broken
about the raid configuration.  Have you tried basic disk-speed
benchmarks?  (Perhaps there's an equivalent of bonnie++ for windows.)

            regards, tom lane


Re: problem with large inserts

From
Lutz Fischer
Date:
Thanks a lot you saved my day

create temp table foo AS SELECT DISTINCT ...
did take a mere 77464.744 ms
And an additional
Insert into LPP select * from foo;
Just 576.909 ms

I don't really understand why it's working via a temp table but not
directly (or in any reasonable amount of time) - but at least I have a
solution I can work with.


On 13/12/12 16:09, Filip Rembiałkowski wrote:
> Just an idea - how long does it take to run _only_
> CREATE TEMP TABLE foo AS <your SELECT here>
>
>
>
>
> On Thu, Dec 13, 2012 at 4:37 PM, Lutz Fischer
> <lfischer@staffmail.ed.ac.uk> wrote:
>> Hi
>>
>> I have currently some trouble with inserts into a table
>>
>> INSERT INTO LPP (PPID, LID)
>> SELECT DISTINCT PPid, LID FROM
>>         (SELECT * FROM PP WHERE s_id = sid) pp
>>             INNER JOIN
>>         has_protein hp1
>>             ON pp.p1id = hp1.pid
>>             INNER JOIN
>>         has_protein hp2
>>             ON pp.p2_id = hp2.pid
>>             INNER JOIN
>>         (SELECT * FROM L WHERE s_id = sid) l
>>             ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
>> pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
>>             OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
>> pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
>>     ;
>>
>> If I run only
>>
>> SELECT DISTINCT PPid, LID FROM
>>         (SELECT * FROM PP WHERE s_id = 708) pp
>>             INNER JOIN
>>         has_protein hp1
>>             ON pp.p1id = hp1.pid
>>             INNER JOIN
>>         has_protein hp2
>>             ON pp.p2_id = hp2.pid
>>             INNER JOIN
>>         (SELECT * FROM L WHERE s_id = 708) l
>>             ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
>> pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
>>             OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
>> pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
>>     ;
>>
>> it returns 200620 rows in 170649 ms  ( thats just under 3  minutes). I
>> stopped the actual insert after about 8h.
>>
>> The table that the insert happens to, is following:
>> CREATE TABLE LPP
>> (
>>   ppid bigint NOT NULL,
>>   lid bigint NOT NULL,
>>   CONSTRAINT pk_lpp PRIMARY KEY (ppid,lid)
>> )
>>
>> I also tried without the primary key but that one is still running for
>> more that a day.
>>
>> Currently the table LPP holds 471139 rows. Its linking the PP and the L
>> table.
>>
>> There are no foreign keys referring to that table nor are there any
>> other constraints on it.
>> Previously I had foreign keys on lid and ppid refering to the L and PP
>> table. But in a desperate try to get some speed up I deleted these. -
>> But still...
>>
>> I am running postgresql 9.2 on a windows 2008  R2 server with 256 GB and
>> the database is on something like a raid 1+0 (actually a raid1e)
>> consisting of 3x4TB disks (limit of what could easily be fitted into the
>> server).
>>
>> At the given time  there were no concurrent access to any of the
>> involved tables.
>>
>> Has anybody some idea why the insert takes so long and/or how to speed
>> things up a bit? I could live with something like half an hour - better
>> would be in minutes.
>>
>>
>> Thanks for any responds,
>>
>> Lutz Fischer
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance


--
Lutz Fischer
lfischer@staffmail.ed.ac.uk
+44 131 6517057


The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: problem with large inserts

From
Filip Rembiałkowski
Date:
Hmm, so it is some kind of file / table locking issue, not general IO
system malfunction.

It would be interesting and useful to run this use case on other
postgres instance (or several instances), including non-Windows ones.

OTOH Pg on Windows housekeeping was always "fun" - I advise all my
clients to avoid it for production purposes.




On Thu, Dec 13, 2012 at 5:33 PM, Lutz Fischer
<lfischer@staffmail.ed.ac.uk> wrote:
> Thanks a lot you saved my day
>
> create temp table foo AS SELECT DISTINCT ...
> did take a mere 77464.744 ms
> And an additional
> Insert into LPP select * from foo;
> Just 576.909 ms
>
> I don't really understand why it's working via a temp table but not
> directly (or in any reasonable amount of time) - but at least I have a
> solution I can work with.
>
>
> On 13/12/12 16:09, Filip Rembiałkowski wrote:
>> Just an idea - how long does it take to run _only_
>> CREATE TEMP TABLE foo AS <your SELECT here>
>>
>>
>>
>>
>> On Thu, Dec 13, 2012 at 4:37 PM, Lutz Fischer
>> <lfischer@staffmail.ed.ac.uk> wrote:
>>> Hi
>>>
>>> I have currently some trouble with inserts into a table
>>>
>>> INSERT INTO LPP (PPID, LID)
>>> SELECT DISTINCT PPid, LID FROM
>>>         (SELECT * FROM PP WHERE s_id = sid) pp
>>>             INNER JOIN
>>>         has_protein hp1
>>>             ON pp.p1id = hp1.pid
>>>             INNER JOIN
>>>         has_protein hp2
>>>             ON pp.p2_id = hp2.pid
>>>             INNER JOIN
>>>         (SELECT * FROM L WHERE s_id = sid) l
>>>             ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
>>> pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
>>>             OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
>>> pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
>>>     ;
>>>
>>> If I run only
>>>
>>> SELECT DISTINCT PPid, LID FROM
>>>         (SELECT * FROM PP WHERE s_id = 708) pp
>>>             INNER JOIN
>>>         has_protein hp1
>>>             ON pp.p1id = hp1.pid
>>>             INNER JOIN
>>>         has_protein hp2
>>>             ON pp.p2_id = hp2.pid
>>>             INNER JOIN
>>>         (SELECT * FROM L WHERE s_id = 708) l
>>>             ON (hp1.pid = l.p1id AND hp2.pid = l.p2id AND hp1.ppos +
>>> pp.s1 = l.s1 AND hp2.ppos + pp.s2 = l.s2)
>>>             OR (hp1.pid = l.p2id AND hp2.pid = l.p1id AND hp1.ppos +
>>> pp.s1 = l.s2 AND hp2.ppos + pp.s2 = l.s1)
>>>     ;
>>>
>>> it returns 200620 rows in 170649 ms  ( thats just under 3  minutes). I
>>> stopped the actual insert after about 8h.
>>>
>>> The table that the insert happens to, is following:
>>> CREATE TABLE LPP
>>> (
>>>   ppid bigint NOT NULL,
>>>   lid bigint NOT NULL,
>>>   CONSTRAINT pk_lpp PRIMARY KEY (ppid,lid)
>>> )
>>>
>>> I also tried without the primary key but that one is still running for
>>> more that a day.
>>>
>>> Currently the table LPP holds 471139 rows. Its linking the PP and the L
>>> table.
>>>
>>> There are no foreign keys referring to that table nor are there any
>>> other constraints on it.
>>> Previously I had foreign keys on lid and ppid refering to the L and PP
>>> table. But in a desperate try to get some speed up I deleted these. -
>>> But still...
>>>
>>> I am running postgresql 9.2 on a windows 2008  R2 server with 256 GB and
>>> the database is on something like a raid 1+0 (actually a raid1e)
>>> consisting of 3x4TB disks (limit of what could easily be fitted into the
>>> server).
>>>
>>> At the given time  there were no concurrent access to any of the
>>> involved tables.
>>>
>>> Has anybody some idea why the insert takes so long and/or how to speed
>>> things up a bit? I could live with something like half an hour - better
>>> would be in minutes.
>>>
>>>
>>> Thanks for any responds,
>>>
>>> Lutz Fischer
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>
>
> --
> Lutz Fischer
> lfischer@staffmail.ed.ac.uk
> +44 131 6517057
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>