Thread: update 600000 rows

update 600000 rows

From
okparanoid@free.fr
Date:
Hello

i have a python script to update 600000 rows to one table from a csv file in my
postgres database and it takes me 5 hours to do the transaction...

I'm on debian etch with 8.1 postgres server on a 64 bits quad bi opteron.

I have desactived all index except the primary key who is not updated since it's
the reference column of the update too.

When i run this script the server is not used by any other user.

First when i run htop i see that the memory used is never more than 150 MB.
I don't understand in this case why setting shmall and shmmax kernel's
parameters to 16 GB of memory (the server has 32 GB) increase the rapidity of
the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!

The script is run with only one transaction and pause by moment to let the time
to postgres to write data to disk.

If the data were writed at the end of the transaction will be the perfomance
better ? i wan't that in production data regulary writed to disk to prevent
loosinf of data but it there any interest to write temporary data in disk in a
middle of a transaction ???

I'm completely noob to postgres and database configuration and help  are
welcome.

thanks



Re: update 600000 rows

From
Steve Crawford
Date:
okparanoid@free.fr wrote:
> Hello
>
> i have a python script to update 600000 rows to one table from a csv file in my
> postgres database and it takes me 5 hours to do the transaction...
>
>
Let's see if I guessed correctly.

Your Python script is stepping through a 600,000 row file and updating
information in a table (of unknown rows/columns) by making 600,000
individual updates all wrapped in a big transaction. If correct, that
means you are doing 600,000/(3,600 * 5) = 33 queries/second. If this is
correct, I'd first investigate simply loading the csv data into a
temporary table, creating appropriate indexes, and running a single
query to update your other table.

> First when i run htop i see that the memory used is never more than 150 MB.
> I don't understand in this case why setting shmall and shmmax kernel's
> parameters to 16 GB of memory (the server has 32 GB) increase the rapidity of
> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
>
Are you saying that you did this and the performance improved or you are
wondering if it would?

> The script is run with only one transaction and pause by moment to let the time
> to postgres to write data to disk.
>
This doesn't make sense. If the transaction completes successfully then
PostgreSQL has committed the data to disk (unless you have done
something non-standard and not recommended like turning off fsync). If
you are adding pauses between updates, don't do that - it will only slow
you down. If the full transaction doesn't complete, all updates will be
thrown away anyway and if it does complete then they were committed.
> If the data were writed at the end of the transaction will be the perfomance
> better ? i wan't that in production data regulary writed to disk to prevent
> loosinf of data but it there any interest to write temporary data in disk in a
> middle of a transaction ???
>
>
See above. Actual disk IO is handled by the server. PostgreSQL is good
at the "D" in ACID. If your transaction completes, the data has been
written to disk. Guaranteed.

Cheers,
Steve

Re: update 600000 rows

From
Loïc Marteau
Date:
Steve Crawford wrote:
> okparanoid@free.fr wrote:
>> Hello
>>
>> i have a python script to update 600000 rows to one table from a csv
>> file in my
>> postgres database and it takes me 5 hours to do the transaction...
>>
>>
> Let's see if I guessed correctly.
>
> Your Python script is stepping through a 600,000 row file and updating
> information in a table (of unknown rows/columns) by making 600,000
> individual updates all wrapped in a big transaction. If correct, that
> means you are doing 600,000/(3,600 * 5) = 33 queries/second. If this
> is correct, I'd first investigate simply loading the csv data into a
> temporary table, creating appropriate indexes, and running a single
> query to update your other table.

i can try this. The problem is that i have to make an insert if the
update don't have affect a rows (the rows don't exist yet). The number
of rows affected by insert is minor regards to the numbers of updated
rows and was 0 when i test my script). I can do with a temporary table
: update all the possible rows and then insert the rows that are in
temporary table and not in the production table with a 'not in'
statement. is this a correct way ?
>
>> First when i run htop i see that the memory used is never more than
>> 150 MB.
>> I don't understand in this case why setting shmall and shmmax kernel's
>> parameters to 16 GB of memory (the server has 32 GB) increase the
>> rapidity of
>> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
>>
> Are you saying that you did this and the performance improved or you
> are wondering if it would?
>
Yes i did this and the perfomance improved. Dont understand why. Sorry
for my poor english...


>> The script is run with only one transaction and pause by moment to
>> let the time
>> to postgres to write data to disk.
>>
> This doesn't make sense. If the transaction completes successfully
> then PostgreSQL has committed the data to disk (unless you have done
> something non-standard and not recommended like turning off fsync). If
> you are adding pauses between updates, don't do that - it will only
> slow you down. If the full transaction doesn't complete, all updates
> will be thrown away anyway and if it does complete then they were
> committed.

Sorry, the pause is not caused by the python script but by postgres
himself. it does an average of +-3000 update and pause 2 min (htop say
me that postgres is in writing process don't really know if it does io
writing). I say that : if he writes to disk some things during the
transaction i don't understand why ?!
>> If the data were writed at the end of the transaction will be the
>> perfomance
>> better ? i wan't that in production data regulary writed to disk to
>> prevent
>> loosinf of data but it there any interest to write temporary data in
>> disk in a
>> middle of a transaction ???
>>
>>
> See above. Actual disk IO is handled by the server. PostgreSQL is good
> at the "D" in ACID. If your transaction completes, the data has been
> written to disk. Guaranteed.
>
> Cheers,
> Steve
>
>
i try to say that in "normal" use (not when i run this maintenance
script) i want to be sure that by insert update request are write to
disk. They are small (1,2 or 3 rows affected) but they are a lot and
doing by many users. However just for this maintenance script i can
perhaps doing other tweak to adjust the io stress during the transaction ?!

Cheers,

Loic



Re: update 600000 rows

From
Greg Smith
Date:
On Sat, 15 Dec 2007, okparanoid@free.fr wrote:

> First when i run htop i see that the memory used is never more than 150 MB.
> I don't understand in this case why setting shmall and shmmax kernel's
> parameters to 16 GB of memory (the server has 32 GB) increase the rapidity of
> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!

The kernel parameters provide an upper limit for how much memory
PostgreSQL can allocate, but by themselves they don't actually request
more memory.  There is a configuration parameters called shared_buffers
that is the main thing to adjust.  Since you say you're new to this, see
http://www.westnet.com/~gsmith/content/postgresql/pg-5minute.htm for the
first set of things you should be adjusting.

If you're doing lots of updates, you'll need to increase
checkpoint_segments as well.  Once you get the memory allocated and
checkpoint parameters in the right range, at that point you'll be prepared
to look into transaction grouping and application issues in that area.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: update 600000 rows

From
andrew@pillette.com
Date:
Lo�c Marteau <okparanoid@free.fr> wrote ..
> Steve Crawford wrote:
> > If this
> > is correct, I'd first investigate simply loading the csv data into a
> > temporary table, creating appropriate indexes, and running a single
> > query to update your other table.

My experience is that this is MUCH faster. My predecessor in my current position was doing an update from a csv file
lineby line with perl. That is one reason he is my predecessor. Performance did not justify continuing his contract. 

> i can try this. The problem is that i have to make an insert if the
> update don't have affect a rows (the rows don't exist yet). The number
> of rows affected by insert is minor regards to the numbers of updated
> rows and was 0 when i test my script). I can do with a temporary table
> : update all the possible rows and then insert the rows that are in
> temporary table and not in the production table with a 'not in'
> statement. is this a correct way ?

That's what I did at first, but later I found better performance with a TRIGGER on the permanent table that deletes the
targetof an UPDATE, if any, before the UPDATE. That's what PG does anyway, and now I can do the entire UPDATE in one
command.

Re: update 600000 rows

From
"H. Hall"
Date:
okparanoid@free.fr wrote:
> Hello
>
> i have a python script to update 600000 rows to one table from a csv file in my
> postgres database and it takes me 5 hours to do the transaction...
>
> I'm on debian etch with 8.1 postgres server on a 64 bits quad bi opteron.
>
> I have desactived all index except the primary key who is not updated since it's
> the reference column of the update too.
>
> When i run this script the server is not used by any other user.
>
> First when i run htop i see that the memory used is never more than 150 MB.
> I don't understand in this case why setting shmall and shmmax kernel's
> parameters to 16 GB of memory (the server has 32 GB) increase the rapidity of
> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
>
> The script is run with only one transaction and pause by moment to let the time
> to postgres to write data to disk.
>
> If the data were writed at the end of the transaction will be the perfomance
> better ? i wan't that in production data regulary writed to disk to prevent
> loosinf of data but it there any interest to write temporary data in disk in a
> middle of a transaction ???
>
> I'm completely noob to postgres and database configuration and help  are
> welcome.
>
> thank

You will get a huge improvement in time if you use batch updates instead
of updating a row at a time. See:

    http://www.postgresql.org/docs/8.2/interactive/populate.html
        and
    http://www.postgresql.org/docs/8.2/interactive/sql-begin.html

You will also get a big improvement if you can turn fsync off during the
update. See:
    http://www.postgresql.org/docs/8.2/interactive/runtime-config-wal.html

You also need to vacuum the table after doing that many updates since pg
does a delete and insert on each update, there will be a lot of holes.

Cheers
HH




Re: update 600000 rows

From
"Merlin Moncure"
Date:
On Dec 16, 2007 12:21 AM,  <andrew@pillette.com> wrote:
> Loïc Marteau <okparanoid@free.fr> wrote ..
> > Steve Crawford wrote:
> > > If this
> > > is correct, I'd first investigate simply loading the csv data into a
> > > temporary table, creating appropriate indexes, and running a single
> > > query to update your other table.
>
> My experience is that this is MUCH faster. My predecessor in my current position was doing an update from a csv file
lineby line with perl. That is one reason he is my predecessor. Performance did not justify continuing his contract. 
>
> > i can try this. The problem is that i have to make an insert if the
> > update don't have affect a rows (the rows don't exist yet). The number
> > of rows affected by insert is minor regards to the numbers of updated
> > rows and was 0 when i test my script). I can do with a temporary table
> > : update all the possible rows and then insert the rows that are in
> > temporary table and not in the production table with a 'not in'
> > statement. is this a correct way ?
>
> That's what I did at first, but later I found better performance with a TRIGGER on the permanent table that deletes
thetarget of an UPDATE, if any, before the UPDATE. That's what PG does anyway, and now I can do the entire UPDATE in
onecommand. 

that's very clever, and probably is the fastest/best way to do it.
you can even temporarily add the trigger a transaction...I am going to
try this out in a couple of things (I currently do these type of
things in two statements) and see how it turns out.

merlin

Re: update 600000 rows

From
andrew@pillette.com
Date:
I wrote
> That's what I did at first, but later I found better performance with
> a TRIGGER on the permanent table that deletes the target of an UPDATE,
> if any, before the UPDATE. [INSERT!] That's what PG does anyway, and now I can do
> the entire UPDATE [INSERT] in one command.

It's probably obvious, but what I should have said is that now I do the INSERT in one command; I have no need of an
UPDATE.So I do no UPDATEs, only INSERTs. 

Re: update 600000 rows

From
"H. Hall"
Date:
Note: I am resending this because the first never appeared after 40hrs.
HH

okparanoid@free.fr wrote:
> Hello
>
> i have a python script to update 600000 rows to one table from a csv file in my
> postgres database and it takes me 5 hours to do the transaction...
>
> I'm on debian etch with 8.1 postgres server on a 64 bits quad bi opteron.
>
> I have desactived all index except the primary key who is not updated since it's
> the reference column of the update too.
>
> When i run this script the server is not used by any other user.
>
> First when i run htop i see that the memory used is never more than 150 MB.
> I don't understand in this case why setting shmall and shmmax kernel's
> parameters to 16 GB of memory (the server has 32 GB) increase the rapidity of
> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
>
> The script is run with only one transaction and pause by moment to let the time
> to postgres to write data to disk.
>
> If the data were writed at the end of the transaction will be the perfomance
> better ? i wan't that in production data regulary writed to disk to prevent
> loosinf of data but it there any interest to write temporary data in disk in a
> middle of a transaction ???
>
> I'm completely noob to postgres and database configuration and help  are
> welcome.
>
> thank

You will get a huge improvement in time if you use batch updates instead
of updating a row at a time. See:

    http://www.postgresql.org/docs/8.2/interactive/populate.html
        and
    http://www.postgresql.org/docs/8.2/interactive/sql-begin.html

You will also get a big improvement if you can turn fsync off during the
update. See:
    http://www.postgresql.org/docs/8.2/interactive/runtime-config-wal.html

You also need to vacuum the table after doing that many updates since pg
does a delete and insert on each update, there will be a lot of holes.

Cheers
HH