Thread: COPY from question

COPY from question

From

Slavisa Garic

Date:

02 February 2004, 23:37:34

Hi,

I have a question about the COPY statement. I am using PGSQL(7.3.4)  with
python-2.3 on RedHat v8 machine. The problem I have is the following.

Using pg module in python I am trying to run the COPY command to populate
the large table. I am using this to replace the INSERT which takes about
few hours to add 70000 entries where copy takes minute and a half. Now
these stats come from the NetBSD machine I also use which doesn't have
this problem but has same python and same pgsql installed.

My understanding is that COPY workes FROM 'filename' or STDIN where the
last characters are '.\\n'. I tried using the copy from 'filename' and as
I said NetBSD is not complaining where I get the following error on Linux
machine even if permissions on the data file are 777:
 _pg.error: ERROR:  COPY command, running in backend with effective uid 26, could not open file
'/home/slavisa/.nimrod/experiments/demo/ejdata'for reading.  Errno = Permission denied (13).
 

I can't figure out why would this be occuring so I wanted to switch to
FROM STDIN option but I got stuck here due to lack of knowledge I have to
admit. 

What I would like to ask anyone who knows anything about this. If you know
what the problem is with FROM file option or you know how to get COPY FROM
STDIN working from within the python (or any other) program, help would be
greatly appreciated,

Regards,
Slavisa

Re: COPY from question

From

Kris Jurka

Date:

03 February 2004, 00:01:07


On Tue, 3 Feb 2004, Slavisa Garic wrote:

> My understanding is that COPY workes FROM 'filename' or STDIN where the
> last characters are '.\\n'. I tried using the copy from 'filename' and as
> I said NetBSD is not complaining where I get the following error on Linux
> machine even if permissions on the data file are 777:
>
>   _pg.error: ERROR:  COPY command, running in backend with effective uid
>   26, could not open file '/home/slavisa/.nimrod/experiments/demo/ejdata'
>   for reading.  Errno = Permission denied (13).
>

This is probably a permissions problem at a higher level, check the
permissions on the directories in the path.

Kris Jurka

Re: COPY from question

From

Tom Lane

Date:

03 February 2004, 00:07:19

Slavisa Garic <Slavisa.Garic@infotech.monash.edu.au> writes:
> ... I get the following error on Linux
> machine even if permissions on the data file are 777:

>   _pg.error: ERROR:  COPY command, running in backend with effective uid
>   26, could not open file '/home/slavisa/.nimrod/experiments/demo/ejdata'
>   for reading.  Errno = Permission denied (13).

Most likely the postgres user doesn't have read permission for one of
the directories in that path.
        regards, tom lane

PS: this didn't really belong on pghackers.

Re: COPY from question

From

Kevin Brown

Date:

03 February 2004, 06:58:03

Slavisa Garic wrote:
> Using pg module in python I am trying to run the COPY command to populate
> the large table. I am using this to replace the INSERT which takes about
> few hours to add 70000 entries where copy takes minute and a half. 

That difference in speed seems quite large.  Too large.  Are you batching
your INSERTs into transactions (you should be in order to get good
performance)?  Do you have a ton of indexes on the table?  Does it have
triggers on it or some other thing (if so then COPY may well wind up doing
the wrong thing since the triggers won't fire for the rows it inserts)?

I don't know what kind of schema you're using, but it takes perhaps a
couple of hours to insert 2.5 million rows on my system.  But the rows
in my schema may be much smaller than yours.

-- 
Kevin Brown                          kevin@sysexperts.com

Re: COPY from question

From

Tom Lane

Date:

03 February 2004, 11:10:37

Kevin Brown <kevin@sysexperts.com> writes:
> Slavisa Garic wrote:
>> Using pg module in python I am trying to run the COPY command to populate
>> the large table. I am using this to replace the INSERT which takes about
>> few hours to add 70000 entries where copy takes minute and a half. 

> That difference in speed seems quite large.  Too large.  Are you batching
> your INSERTs into transactions (you should be in order to get good
> performance)?  Do you have a ton of indexes on the table?  Does it have
> triggers on it or some other thing (if so then COPY may well wind up doing
> the wrong thing since the triggers won't fire for the rows it
> inserts)?

COPY *does* fire triggers, and has done so for quite a few releases.

My bet is that the issue is failing to batch individual INSERTs into
transactions.  On a properly-set-up machine you can't get more than one
transaction commit per client per disk revolution, so the penalty for
trivial transactions like single inserts is pretty steep.
        regards, tom lane

Re: COPY from question

From

Slavisa Garic

Date:

04 February 2004, 21:10:51

Hi Kevin,

On Tue, 3 Feb 2004, Kevin Brown wrote:

> Slavisa Garic wrote:
> > Using pg module in python I am trying to run the COPY command to populate
> > the large table. I am using this to replace the INSERT which takes about
> > few hours to add 70000 entries where copy takes minute and a half. 
> 
> That difference in speed seems quite large.  Too large.  Are you batching
> your INSERTs into transactions (you should be in order to get good
> performance)?  Do you have a ton of indexes on the table?  Does it have
> triggers on it or some other thing (if so then COPY may well wind up doing
> the wrong thing since the triggers won't fire for the rows it inserts)?
> 
> I don't know what kind of schema you're using, but it takes perhaps a
> couple of hours to insert 2.5 million rows on my system.  But the rows
> in my schema may be much smaller than yours.

You are right about the indexes. There is quite a few of them (5-6 without
looking at the schema). The problem is that I do need those indexes as I
have a lot of SELECTs on that table and inserts are only happening once.

You are also right about the rows (i think) as I have about 15-20 columns.
This could be split into few other table and it used to be but I have
merged them because of the requirement for the faster SELECTs. With the
current schema there most of my modules that access the database are not
required to do expensive JOINs as they used to. Because faster SELECTs are
more important to me then faster INSERTs I had to do this. THis wasn't a
problem for me until I have started creating experiments which had more
than 20 thousand jobs which translates to 20 thousand rows in this big
table.

I do batch INSERTs into one big transaction (1000 rows at a time). While i
did get some improvement compared to the single transaction per insert it
was still not fast enough (well not for me :) ). Could you please
elaborate on the triggers? I have no idea what kind of triggers there are
in PGSQL or relational databases.

With regards to my problem, I did solve it by piping the data into the
COPY stdin. Now I have about 75000 rows inserted in 40 seconds which is
extremely good for me.

Thank you for your help,
Regards,
Slavisa

> -- 
> Kevin Brown                          kevin@sysexperts.com
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>