Re: update 600000 rows - Mailing list pgsql-performance

From Loïc Marteau
Subject Re: update 600000 rows
Date
Msg-id 4763BDCE.9010705@free.fr
Whole thread Raw
In response to Re: update 600000 rows  (Steve Crawford <scrawford@pinpointresearch.com>)
List pgsql-performance
Steve Crawford wrote:
> okparanoid@free.fr wrote:
>> Hello
>>
>> i have a python script to update 600000 rows to one table from a csv
>> file in my
>> postgres database and it takes me 5 hours to do the transaction...
>>
>>
> Let's see if I guessed correctly.
>
> Your Python script is stepping through a 600,000 row file and updating
> information in a table (of unknown rows/columns) by making 600,000
> individual updates all wrapped in a big transaction. If correct, that
> means you are doing 600,000/(3,600 * 5) = 33 queries/second. If this
> is correct, I'd first investigate simply loading the csv data into a
> temporary table, creating appropriate indexes, and running a single
> query to update your other table.

i can try this. The problem is that i have to make an insert if the
update don't have affect a rows (the rows don't exist yet). The number
of rows affected by insert is minor regards to the numbers of updated
rows and was 0 when i test my script). I can do with a temporary table
: update all the possible rows and then insert the rows that are in
temporary table and not in the production table with a 'not in'
statement. is this a correct way ?
>
>> First when i run htop i see that the memory used is never more than
>> 150 MB.
>> I don't understand in this case why setting shmall and shmmax kernel's
>> parameters to 16 GB of memory (the server has 32 GB) increase the
>> rapidity of
>> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
>>
> Are you saying that you did this and the performance improved or you
> are wondering if it would?
>
Yes i did this and the perfomance improved. Dont understand why. Sorry
for my poor english...


>> The script is run with only one transaction and pause by moment to
>> let the time
>> to postgres to write data to disk.
>>
> This doesn't make sense. If the transaction completes successfully
> then PostgreSQL has committed the data to disk (unless you have done
> something non-standard and not recommended like turning off fsync). If
> you are adding pauses between updates, don't do that - it will only
> slow you down. If the full transaction doesn't complete, all updates
> will be thrown away anyway and if it does complete then they were
> committed.

Sorry, the pause is not caused by the python script but by postgres
himself. it does an average of +-3000 update and pause 2 min (htop say
me that postgres is in writing process don't really know if it does io
writing). I say that : if he writes to disk some things during the
transaction i don't understand why ?!
>> If the data were writed at the end of the transaction will be the
>> perfomance
>> better ? i wan't that in production data regulary writed to disk to
>> prevent
>> loosinf of data but it there any interest to write temporary data in
>> disk in a
>> middle of a transaction ???
>>
>>
> See above. Actual disk IO is handled by the server. PostgreSQL is good
> at the "D" in ACID. If your transaction completes, the data has been
> written to disk. Guaranteed.
>
> Cheers,
> Steve
>
>
i try to say that in "normal" use (not when i run this maintenance
script) i want to be sure that by insert update request are write to
disk. They are small (1,2 or 3 rows affected) but they are a lot and
doing by many users. However just for this maintenance script i can
perhaps doing other tweak to adjust the io stress during the transaction ?!

Cheers,

Loic



pgsql-performance by date:

Previous
From: Jeff Davis
Date:
Subject: Re: explanation for seeks in VACUUM
Next
From: Greg Smith
Date:
Subject: Re: update 600000 rows