Re: update 600000 rows - Mailing list pgsql-performance
From | Loïc Marteau |
---|---|
Subject | Re: update 600000 rows |
Date | |
Msg-id | 4763BDCE.9010705@free.fr Whole thread Raw |
In response to | Re: update 600000 rows (Steve Crawford <scrawford@pinpointresearch.com>) |
List | pgsql-performance |
Steve Crawford wrote: > okparanoid@free.fr wrote: >> Hello >> >> i have a python script to update 600000 rows to one table from a csv >> file in my >> postgres database and it takes me 5 hours to do the transaction... >> >> > Let's see if I guessed correctly. > > Your Python script is stepping through a 600,000 row file and updating > information in a table (of unknown rows/columns) by making 600,000 > individual updates all wrapped in a big transaction. If correct, that > means you are doing 600,000/(3,600 * 5) = 33 queries/second. If this > is correct, I'd first investigate simply loading the csv data into a > temporary table, creating appropriate indexes, and running a single > query to update your other table. i can try this. The problem is that i have to make an insert if the update don't have affect a rows (the rows don't exist yet). The number of rows affected by insert is minor regards to the numbers of updated rows and was 0 when i test my script). I can do with a temporary table : update all the possible rows and then insert the rows that are in temporary table and not in the production table with a 'not in' statement. is this a correct way ? > >> First when i run htop i see that the memory used is never more than >> 150 MB. >> I don't understand in this case why setting shmall and shmmax kernel's >> parameters to 16 GB of memory (the server has 32 GB) increase the >> rapidity of >> the transaction a lot compared to a shmall and shmax in (only) 2 GB ?! >> > Are you saying that you did this and the performance improved or you > are wondering if it would? > Yes i did this and the perfomance improved. Dont understand why. Sorry for my poor english... >> The script is run with only one transaction and pause by moment to >> let the time >> to postgres to write data to disk. >> > This doesn't make sense. If the transaction completes successfully > then PostgreSQL has committed the data to disk (unless you have done > something non-standard and not recommended like turning off fsync). If > you are adding pauses between updates, don't do that - it will only > slow you down. If the full transaction doesn't complete, all updates > will be thrown away anyway and if it does complete then they were > committed. Sorry, the pause is not caused by the python script but by postgres himself. it does an average of +-3000 update and pause 2 min (htop say me that postgres is in writing process don't really know if it does io writing). I say that : if he writes to disk some things during the transaction i don't understand why ?! >> If the data were writed at the end of the transaction will be the >> perfomance >> better ? i wan't that in production data regulary writed to disk to >> prevent >> loosinf of data but it there any interest to write temporary data in >> disk in a >> middle of a transaction ??? >> >> > See above. Actual disk IO is handled by the server. PostgreSQL is good > at the "D" in ACID. If your transaction completes, the data has been > written to disk. Guaranteed. > > Cheers, > Steve > > i try to say that in "normal" use (not when i run this maintenance script) i want to be sure that by insert update request are write to disk. They are small (1,2 or 3 rows affected) but they are a lot and doing by many users. However just for this maintenance script i can perhaps doing other tweak to adjust the io stress during the transaction ?! Cheers, Loic
pgsql-performance by date: