Capacity questions - Mailing list psycopg

From Bill House
Subject Capacity questions
Date
Msg-id 4FDDF88D.9000908@house-grp.net
Whole thread Raw
Responses Re: Capacity questions
Re: Capacity questions
List psycopg
Hello all,

I have a written a script which does what I want it to do on a table
with 999 records.

I was wondering if there is anything I need to consider if I run this
script on a table with 1.1 million records?

The purpose of the script is to flag records representing files if they
are redundant (duplicate md5sum).

Later this table will be used to move or remove the files represented by
the flagged records.

The script performs the following actions

    1 resets the del column  to false for all records

    2 creates a temporary table consisting of aggregate records of
identical md5sums which count more than 1.

    3 iterates down the cursor and with this information

    4 updates the del column to True on all records in a batch of
identical md5sums except for one.

Details are below.

Thanks in advance for your advise.

Regards,

Bill


The server hosting postgresql:

    32 bit 2.8GHz cpu with 2Gb ram running openSUSE 12.1.

The workstation executing this script:

    64 bit 3.1 GHz cpu with 12Gb ram running openSUSE 12.1.

Table description:

      Table "public.files_test"
 Column |      Type       | Modifiers
--------+-----------------+-----------
 md5sum | character(33)   |
 path   | character(475)  |
 file   | character(200)  |
 del    | boolean         |
 recno  | integer         | not null
Indexes:
    "files_test_ord" btree (recno)

Script:

<CODE>
#!/usr/bin/python
#20120609


import sys
import os

import psycopg2
import psycopg2.extras
import psycopg2.extensions

conn = None


sys.path.append('/mnt/nfs/bb/library_1/stuff_mine/wch/z_projects/computer/z_development/languages/python/z_wch_functions')

import z_all

#z_all.zz_clear()

debug = 0
if debug == 1:
  z_all.zz_set_trace_on()


def main():
  #get the connection information
  cxn = z_all.zz_pickle_load('_cxn_bb_wch')

  table = "files_test"

  try:
    #use connection information from the list retrieved from the pickle file
    conn = psycopg2.connect(host= cxn['host'],
                            dbname=cxn['db'],
                            user=cxn['user'],
                            password=cxn['password'])


    print "Resetting all del fields to false."
    working_curs = conn.cursor()
    cmd = "working_curs.execute(\"\"\"UPDATE %s SET del = False;\"\"\")"
% table ; exec cmd
    conn.commit()
    working_curs.close()

    print "Creating walking cursor."
    walking_curs = conn.cursor("walking_curs")
    cmd = "walking_curs.execute(\"\"\"SELECT md5sum, count(md5sum) FROM
%s GROUP BY md5sum HAVING count(md5sum) > 1;\"\"\")" % table ; exec cmd
    for row in walking_curs:
      mem_md5    = row[0]
      mem_cnt    = row[1]
      row_cnt =+ 1
      print mem_md5, mem_cnt, len(mem_md5)

      print "creating working_cursora"
      working_cursa = conn.cursor()

      print "running command on working_cursora with substitution"
      cmd = "working_cursa.execute(\"\"\"WITH batch AS (SELECT * FROM %s
WHERE md5sum = \'%s\' ORDER BY path DESC OFFSET 1) UPDATE %s SET del =
True FROM batch WHERE batch.md5sum || batch.path = files_test.md5sum ||
files_test.path;\"\"\")" % (table, mem_md5, table) ; exec
cmd


      print "closing working_cursora"
      working_cursa.close()

    walking_curs.close()

    conn.commit()



  except psycopg2.DatabaseError, e:
    print "psycopg2 reports error: {}".format(e)
    sys.exit(1)

  finally:

    if conn:
      conn.close()
    return

if __name__ == '__main__':
  main()





</CODE>



psycopg by date:

Previous
From: Marc Abramowitz
Date:
Subject: Patches to add support for tox and Travis CI tools to psycopg2
Next
From: "W. Matthew Wilson"
Date:
Subject: When I select a single column, can I prevent getting a list of one-element tuples?