Home > mailing lists

Capacity questions - Mailing list psycopg

From	Bill House
Subject	Capacity questions
Date	June 17, 2012 15:32:48
Msg-id	4FDDF88D.9000908@house-grp.net Whole thread Raw
Responses	Re: Capacity questions Re: Capacity questions
List	psycopg

Tree view

Hello all,

I have a written a script which does what I want it to do on a table
with 999 records.

I was wondering if there is anything I need to consider if I run this
script on a table with 1.1 million records?

The purpose of the script is to flag records representing files if they
are redundant (duplicate md5sum).

Later this table will be used to move or remove the files represented by
the flagged records.

The script performs the following actions

    1 resets the del column  to false for all records

    2 creates a temporary table consisting of aggregate records of
identical md5sums which count more than 1.

    3 iterates down the cursor and with this information

    4 updates the del column to True on all records in a batch of
identical md5sums except for one.

Details are below.

Thanks in advance for your advise.

Regards,

Bill


The server hosting postgresql:

    32 bit 2.8GHz cpu with 2Gb ram running openSUSE 12.1.

The workstation executing this script:

    64 bit 3.1 GHz cpu with 12Gb ram running openSUSE 12.1.

Table description:

      Table "public.files_test"
 Column |      Type       | Modifiers
--------+-----------------+-----------
 md5sum | character(33)   |
 path   | character(475)  |
 file   | character(200)  |
 del    | boolean         |
 recno  | integer         | not null
Indexes:
    "files_test_ord" btree (recno)

Script:

<CODE>
#!/usr/bin/python
#20120609


import sys
import os

import psycopg2
import psycopg2.extras
import psycopg2.extensions

conn = None


sys.path.append('/mnt/nfs/bb/library_1/stuff_mine/wch/z_projects/computer/z_development/languages/python/z_wch_functions')

import z_all

#z_all.zz_clear()

debug = 0
if debug == 1:
  z_all.zz_set_trace_on()


def main():
  #get the connection information
  cxn = z_all.zz_pickle_load('_cxn_bb_wch')

  table = "files_test"

  try:
    #use connection information from the list retrieved from the pickle file
    conn = psycopg2.connect(host= cxn['host'],
                            dbname=cxn['db'],
                            user=cxn['user'],
                            password=cxn['password'])


    print "Resetting all del fields to false."
    working_curs = conn.cursor()
    cmd = "working_curs.execute(\"\"\"UPDATE %s SET del = False;\"\"\")"
% table ; exec cmd
    conn.commit()
    working_curs.close()

    print "Creating walking cursor."
    walking_curs = conn.cursor("walking_curs")
    cmd = "walking_curs.execute(\"\"\"SELECT md5sum, count(md5sum) FROM
%s GROUP BY md5sum HAVING count(md5sum) > 1;\"\"\")" % table ; exec cmd
    for row in walking_curs:
      mem_md5    = row[0]
      mem_cnt    = row[1]
      row_cnt =+ 1
      print mem_md5, mem_cnt, len(mem_md5)

      print "creating working_cursora"
      working_cursa = conn.cursor()

      print "running command on working_cursora with substitution"
      cmd = "working_cursa.execute(\"\"\"WITH batch AS (SELECT * FROM %s
WHERE md5sum = \'%s\' ORDER BY path DESC OFFSET 1) UPDATE %s SET del =
True FROM batch WHERE batch.md5sum || batch.path = files_test.md5sum ||
files_test.path;\"\"\")" % (table, mem_md5, table) ; exec
cmd


      print "closing working_cursora"
      working_cursa.close()

    walking_curs.close()

    conn.commit()



  except psycopg2.DatabaseError, e:
    print "psycopg2 reports error: {}".format(e)
    sys.exit(1)

  finally:

    if conn:
      conn.close()
    return

if __name__ == '__main__':
  main()





</CODE>

psycopg by date:

From: Marc Abramowitz
Date: 15 June 2012, 16:09:29
Subject: Patches to add support for tox and Travis CI tools to psycopg2

From: "W. Matthew Wilson"
Date: 17 June 2012, 15:55:31
Subject: When I select a single column, can I prevent getting a list of one-element tuples?

Capacity questions - Mailing list psycopg

Previous

Next