Capacity questions - Mailing list psycopg
From | Bill House |
---|---|
Subject | Capacity questions |
Date | |
Msg-id | 4FDDF88D.9000908@house-grp.net Whole thread Raw |
Responses |
Re: Capacity questions
Re: Capacity questions |
List | psycopg |
Hello all, I have a written a script which does what I want it to do on a table with 999 records. I was wondering if there is anything I need to consider if I run this script on a table with 1.1 million records? The purpose of the script is to flag records representing files if they are redundant (duplicate md5sum). Later this table will be used to move or remove the files represented by the flagged records. The script performs the following actions 1 resets the del column to false for all records 2 creates a temporary table consisting of aggregate records of identical md5sums which count more than 1. 3 iterates down the cursor and with this information 4 updates the del column to True on all records in a batch of identical md5sums except for one. Details are below. Thanks in advance for your advise. Regards, Bill The server hosting postgresql: 32 bit 2.8GHz cpu with 2Gb ram running openSUSE 12.1. The workstation executing this script: 64 bit 3.1 GHz cpu with 12Gb ram running openSUSE 12.1. Table description: Table "public.files_test" Column | Type | Modifiers --------+-----------------+----------- md5sum | character(33) | path | character(475) | file | character(200) | del | boolean | recno | integer | not null Indexes: "files_test_ord" btree (recno) Script: <CODE> #!/usr/bin/python #20120609 import sys import os import psycopg2 import psycopg2.extras import psycopg2.extensions conn = None sys.path.append('/mnt/nfs/bb/library_1/stuff_mine/wch/z_projects/computer/z_development/languages/python/z_wch_functions') import z_all #z_all.zz_clear() debug = 0 if debug == 1: z_all.zz_set_trace_on() def main(): #get the connection information cxn = z_all.zz_pickle_load('_cxn_bb_wch') table = "files_test" try: #use connection information from the list retrieved from the pickle file conn = psycopg2.connect(host= cxn['host'], dbname=cxn['db'], user=cxn['user'], password=cxn['password']) print "Resetting all del fields to false." working_curs = conn.cursor() cmd = "working_curs.execute(\"\"\"UPDATE %s SET del = False;\"\"\")" % table ; exec cmd conn.commit() working_curs.close() print "Creating walking cursor." walking_curs = conn.cursor("walking_curs") cmd = "walking_curs.execute(\"\"\"SELECT md5sum, count(md5sum) FROM %s GROUP BY md5sum HAVING count(md5sum) > 1;\"\"\")" % table ; exec cmd for row in walking_curs: mem_md5 = row[0] mem_cnt = row[1] row_cnt =+ 1 print mem_md5, mem_cnt, len(mem_md5) print "creating working_cursora" working_cursa = conn.cursor() print "running command on working_cursora with substitution" cmd = "working_cursa.execute(\"\"\"WITH batch AS (SELECT * FROM %s WHERE md5sum = \'%s\' ORDER BY path DESC OFFSET 1) UPDATE %s SET del = True FROM batch WHERE batch.md5sum || batch.path = files_test.md5sum || files_test.path;\"\"\")" % (table, mem_md5, table) ; exec cmd print "closing working_cursora" working_cursa.close() walking_curs.close() conn.commit() except psycopg2.DatabaseError, e: print "psycopg2 reports error: {}".format(e) sys.exit(1) finally: if conn: conn.close() return if __name__ == '__main__': main() </CODE>