Note: I am resending this because the first never appeared after 40hrs.
HH
okparanoid@free.fr wrote:
> Hello
>
> i have a python script to update 600000 rows to one table from a csv file in my
> postgres database and it takes me 5 hours to do the transaction...
>
> I'm on debian etch with 8.1 postgres server on a 64 bits quad bi opteron.
>
> I have desactived all index except the primary key who is not updated since it's
> the reference column of the update too.
>
> When i run this script the server is not used by any other user.
>
> First when i run htop i see that the memory used is never more than 150 MB.
> I don't understand in this case why setting shmall and shmmax kernel's
> parameters to 16 GB of memory (the server has 32 GB) increase the rapidity of
> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
>
> The script is run with only one transaction and pause by moment to let the time
> to postgres to write data to disk.
>
> If the data were writed at the end of the transaction will be the perfomance
> better ? i wan't that in production data regulary writed to disk to prevent
> loosinf of data but it there any interest to write temporary data in disk in a
> middle of a transaction ???
>
> I'm completely noob to postgres and database configuration and help are
> welcome.
>
> thank
You will get a huge improvement in time if you use batch updates instead
of updating a row at a time. See:
http://www.postgresql.org/docs/8.2/interactive/populate.html
and
http://www.postgresql.org/docs/8.2/interactive/sql-begin.html
You will also get a big improvement if you can turn fsync off during the
update. See:
http://www.postgresql.org/docs/8.2/interactive/runtime-config-wal.html
You also need to vacuum the table after doing that many updates since pg
does a delete and insert on each update, there will be a lot of holes.
Cheers
HH