Thread: Capacity questions
Hello all, I have a written a script which does what I want it to do on a table with 999 records. I was wondering if there is anything I need to consider if I run this script on a table with 1.1 million records? The purpose of the script is to flag records representing files if they are redundant (duplicate md5sum). Later this table will be used to move or remove the files represented by the flagged records. The script performs the following actions 1 resets the del column to false for all records 2 creates a temporary table consisting of aggregate records of identical md5sums which count more than 1. 3 iterates down the cursor and with this information 4 updates the del column to True on all records in a batch of identical md5sums except for one. Details are below. Thanks in advance for your advise. Regards, Bill The server hosting postgresql: 32 bit 2.8GHz cpu with 2Gb ram running openSUSE 12.1. The workstation executing this script: 64 bit 3.1 GHz cpu with 12Gb ram running openSUSE 12.1. Table description: Table "public.files_test" Column | Type | Modifiers --------+-----------------+----------- md5sum | character(33) | path | character(475) | file | character(200) | del | boolean | recno | integer | not null Indexes: "files_test_ord" btree (recno) Script: <CODE> #!/usr/bin/python #20120609 import sys import os import psycopg2 import psycopg2.extras import psycopg2.extensions conn = None sys.path.append('/mnt/nfs/bb/library_1/stuff_mine/wch/z_projects/computer/z_development/languages/python/z_wch_functions') import z_all #z_all.zz_clear() debug = 0 if debug == 1: z_all.zz_set_trace_on() def main(): #get the connection information cxn = z_all.zz_pickle_load('_cxn_bb_wch') table = "files_test" try: #use connection information from the list retrieved from the pickle file conn = psycopg2.connect(host= cxn['host'], dbname=cxn['db'], user=cxn['user'], password=cxn['password']) print "Resetting all del fields to false." working_curs = conn.cursor() cmd = "working_curs.execute(\"\"\"UPDATE %s SET del = False;\"\"\")" % table ; exec cmd conn.commit() working_curs.close() print "Creating walking cursor." walking_curs = conn.cursor("walking_curs") cmd = "walking_curs.execute(\"\"\"SELECT md5sum, count(md5sum) FROM %s GROUP BY md5sum HAVING count(md5sum) > 1;\"\"\")" % table ; exec cmd for row in walking_curs: mem_md5 = row[0] mem_cnt = row[1] row_cnt =+ 1 print mem_md5, mem_cnt, len(mem_md5) print "creating working_cursora" working_cursa = conn.cursor() print "running command on working_cursora with substitution" cmd = "working_cursa.execute(\"\"\"WITH batch AS (SELECT * FROM %s WHERE md5sum = \'%s\' ORDER BY path DESC OFFSET 1) UPDATE %s SET del = True FROM batch WHERE batch.md5sum || batch.path = files_test.md5sum || files_test.path;\"\"\")" % (table, mem_md5, table) ; exec cmd print "closing working_cursora" working_cursa.close() walking_curs.close() conn.commit() except psycopg2.DatabaseError, e: print "psycopg2 reports error: {}".format(e) sys.exit(1) finally: if conn: conn.close() return if __name__ == '__main__': main() </CODE>
On Sun, Jun 17, 2012 at 4:32 PM, Bill House <wch-tech@house-grp.net> wrote: > I have a written a script which does what I want it to do on a table > with 999 records. > > I was wondering if there is anything I need to consider if I run this > script on a table with 1.1 million records? It looks fine for me. The only note is that, I may be wrong, but "UPDATE %s SET del = False" updates the records having del already false too, taking more time and creating unnecessary bloat. Adding "WHERE del" may help keeping the bloat to the minimum. -- Daniele
On 06/17/2012 01:02 PM, Daniele Varrazzo wrote: > On Sun, Jun 17, 2012 at 4:32 PM, Bill House <wch-tech@house-grp.net> wrote: > >> I have a written a script which does what I want it to do on a table >> with 999 records. >> >> I was wondering if there is anything I need to consider if I run this >> script on a table with 1.1 million records? > It looks fine for me. The only note is that, I may be wrong, but > "UPDATE %s SET del = False" updates the records having del already > false too, taking more time and creating unnecessary bloat. Adding > "WHERE del" may help keeping the bloat to the minimum. > > -- Daniele > > You are referring to the first SQL command, and you are correct. And on a table this size, time is a consideration; even if it is a one-time task. Thanks very much. Regards, Bill
On 06/17/2012 11:11 AM, Bill House wrote: > On 06/17/2012 01:02 PM, Daniele Varrazzo wrote: >> On Sun, Jun 17, 2012 at 4:32 PM, Bill House <wch-tech@house-grp.net> wrote: >> >>> I have a written a script which does what I want it to do on a table >>> with 999 records. >>> >>> I was wondering if there is anything I need to consider if I run this >>> script on a table with 1.1 million records? >> It looks fine for me. The only note is that, I may be wrong, but >> "UPDATE %s SET del = False" updates the records having del already >> false too, taking more time and creating unnecessary bloat. Adding >> "WHERE del" may help keeping the bloat to the minimum. >> >> -- Daniele >> >> > You are referring to the first SQL command, and you are correct. And on > a table this size, time is a consideration; even if it is a one-time task. An additional comment: You really ought to consider using the parameter passing mechanism built into psycopg2: http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries See here for why that is important: http://blog.endpoint.com/2012/06/detecting-postgres-sql-injection.html > > Thanks very much. > > Regards, > > Bill > -- Adrian Klaver adrian.klaver@gmail.com
On Sun, Jun 17, 2012 at 7:17 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote: > An additional comment: > > You really ought to consider using the parameter passing mechanism built into psycopg2: Oh, absolutely true. I'd only seen the first query, where the % was to pass a table name, which cannot be used (yet) as a parameter, and didn't notice it was later used for a parameter too. -- Daniele
On 17/06/12 17:32, Bill House wrote: > I have a written a script which does what I want it to do on a table > with 999 records. > > I was wondering if there is anything I need to consider if I run this > script on a table with 1.1 million records? > > The purpose of the script is to flag records representing files if they > are redundant (duplicate md5sum). > > Later this table will be used to move or remove the files represented by > the flagged records. > > The script performs the following actions > > 1 resets the del column to false for all records > > 2 creates a temporary table consisting of aggregate records of > identical md5sums which count more than 1. > > 3 iterates down the cursor and with this information > > 4 updates the del column to True on all records in a batch of > identical md5sums except for one. If your recno column is a primary key, you can do this with a single SQL UPDATE, without the initial update, temporary table or psycopg (with the right indices this is blazing fast even on a gazzilion rows): UPDATE files_test SET del = T.n > 1 FROM (SELECT recno, row_number() OVER (PARTITION BY md5sum) AS n FROM files_test) T WHERE files_test.recno = T.recno; Hope this helps, federico (that *loves* UPDATE/FROM and window functions) -- Federico Di Gregorio federico.digregorio@dndg.it Studio Associato Di Nunzio e Di Gregorio http://dndg.it - Ma cos'ha il tuo pesce rosso, l'orchite? - Si, ha un occhio solo, la voce roca e mangia gli altri pesci.
On 06/18/2012 05:04 AM, Federico Di Gregorio wrote: > On 17/06/12 17:32, Bill House wrote: >> I have a written a script which does what I want it to do on a table >> with 999 records. >> >> I was wondering if there is anything I need to consider if I run this >> script on a table with 1.1 million records? >> >> The purpose of the script is to flag records representing files if they >> are redundant (duplicate md5sum). >> >> Later this table will be used to move or remove the files represented by >> the flagged records. >> >> The script performs the following actions >> >> 1 resets the del column to false for all records >> >> 2 creates a temporary table consisting of aggregate records of >> identical md5sums which count more than 1. >> >> 3 iterates down the cursor and with this information >> >> 4 updates the del column to True on all records in a batch of >> identical md5sums except for one. > If your recno column is a primary key, you can do this with a single SQL > UPDATE, without the initial update, temporary table or psycopg (with the > right indices this is blazing fast even on a gazzilion rows): > > UPDATE files_test SET del = T.n > 1 > FROM (SELECT recno, > row_number() OVER (PARTITION BY md5sum) AS n > FROM files_test) T > WHERE files_test.recno = T.recno; > > Hope this helps, > > federico (that *loves* UPDATE/FROM and window functions) > Thanks for the input all. Part of the purpose of this little project is for me to learn about both Python and Postgresql; I know very little about either. That is what brings me to psycopg2. I have read in the documents it is recommend that a programmer avoiding parameter substitution by any means other than the one supplied by psycopg2. I understand the reasons and will comply if I create code for public access. I appreciate the opportunity to study the alternative more efficient approach offered above. As I said, I am new to SQL and have much to learn. Thanks again, Bill