Thread: Capacity questions

Capacity questions

From

Bill House

Date:

17 June 2012, 15:32:48

Hello all,

I have a written a script which does what I want it to do on a table
with 999 records.

I was wondering if there is anything I need to consider if I run this
script on a table with 1.1 million records?

The purpose of the script is to flag records representing files if they
are redundant (duplicate md5sum).

Later this table will be used to move or remove the files represented by
the flagged records.

The script performs the following actions

    1 resets the del column  to false for all records

    2 creates a temporary table consisting of aggregate records of
identical md5sums which count more than 1.

    3 iterates down the cursor and with this information

    4 updates the del column to True on all records in a batch of
identical md5sums except for one.

Details are below.

Thanks in advance for your advise.

Regards,

Bill


The server hosting postgresql:

    32 bit 2.8GHz cpu with 2Gb ram running openSUSE 12.1.

The workstation executing this script:

    64 bit 3.1 GHz cpu with 12Gb ram running openSUSE 12.1.

Table description:

      Table "public.files_test"
 Column |      Type       | Modifiers
--------+-----------------+-----------
 md5sum | character(33)   |
 path   | character(475)  |
 file   | character(200)  |
 del    | boolean         |
 recno  | integer         | not null
Indexes:
    "files_test_ord" btree (recno)

Script:

<CODE>
#!/usr/bin/python
#20120609


import sys
import os

import psycopg2
import psycopg2.extras
import psycopg2.extensions

conn = None


sys.path.append('/mnt/nfs/bb/library_1/stuff_mine/wch/z_projects/computer/z_development/languages/python/z_wch_functions')

import z_all

#z_all.zz_clear()

debug = 0
if debug == 1:
  z_all.zz_set_trace_on()


def main():
  #get the connection information
  cxn = z_all.zz_pickle_load('_cxn_bb_wch')

  table = "files_test"

  try:
    #use connection information from the list retrieved from the pickle file
    conn = psycopg2.connect(host= cxn['host'],
                            dbname=cxn['db'],
                            user=cxn['user'],
                            password=cxn['password'])


    print "Resetting all del fields to false."
    working_curs = conn.cursor()
    cmd = "working_curs.execute(\"\"\"UPDATE %s SET del = False;\"\"\")"
% table ; exec cmd
    conn.commit()
    working_curs.close()

    print "Creating walking cursor."
    walking_curs = conn.cursor("walking_curs")
    cmd = "walking_curs.execute(\"\"\"SELECT md5sum, count(md5sum) FROM
%s GROUP BY md5sum HAVING count(md5sum) > 1;\"\"\")" % table ; exec cmd
    for row in walking_curs:
      mem_md5    = row[0]
      mem_cnt    = row[1]
      row_cnt =+ 1
      print mem_md5, mem_cnt, len(mem_md5)

      print "creating working_cursora"
      working_cursa = conn.cursor()

      print "running command on working_cursora with substitution"
      cmd = "working_cursa.execute(\"\"\"WITH batch AS (SELECT * FROM %s
WHERE md5sum = \'%s\' ORDER BY path DESC OFFSET 1) UPDATE %s SET del =
True FROM batch WHERE batch.md5sum || batch.path = files_test.md5sum ||
files_test.path;\"\"\")" % (table, mem_md5, table) ; exec
cmd


      print "closing working_cursora"
      working_cursa.close()

    walking_curs.close()

    conn.commit()



  except psycopg2.DatabaseError, e:
    print "psycopg2 reports error: {}".format(e)
    sys.exit(1)

  finally:

    if conn:
      conn.close()
    return

if __name__ == '__main__':
  main()





</CODE>

Re: Capacity questions

From

Daniele Varrazzo

Date:

17 June 2012, 18:02:55

On Sun, Jun 17, 2012 at 4:32 PM, Bill House <wch-tech@house-grp.net> wrote:

> I have a written a script which does what I want it to do on a table
> with 999 records.
>
> I was wondering if there is anything I need to consider if I run this
> script on a table with 1.1 million records?

It looks fine for me. The only note is that, I may be wrong, but
"UPDATE %s SET del = False" updates the records having del already
false too, taking more time and creating unnecessary bloat. Adding
"WHERE del" may help keeping the bloat to the minimum.

-- Daniele

Re: Capacity questions

From

Bill House

Date:

17 June 2012, 18:11:45

On 06/17/2012 01:02 PM, Daniele Varrazzo wrote:
> On Sun, Jun 17, 2012 at 4:32 PM, Bill House <wch-tech@house-grp.net> wrote:
>
>> I have a written a script which does what I want it to do on a table
>> with 999 records.
>>
>> I was wondering if there is anything I need to consider if I run this
>> script on a table with 1.1 million records?
> It looks fine for me. The only note is that, I may be wrong, but
> "UPDATE %s SET del = False" updates the records having del already
> false too, taking more time and creating unnecessary bloat. Adding
> "WHERE del" may help keeping the bloat to the minimum.
>
> -- Daniele
>
>
You are referring to the first SQL command, and you are correct.  And on
a table this size, time is a consideration; even if it is a one-time task.

Thanks very much.

Regards,

Bill

Re: Capacity questions

From

Adrian Klaver

Date:

17 June 2012, 18:18:19

On 06/17/2012 11:11 AM, Bill House wrote:
> On 06/17/2012 01:02 PM, Daniele Varrazzo wrote:
>> On Sun, Jun 17, 2012 at 4:32 PM, Bill House <wch-tech@house-grp.net> wrote:
>>
>>> I have a written a script which does what I want it to do on a table
>>> with 999 records.
>>>
>>> I was wondering if there is anything I need to consider if I run this
>>> script on a table with 1.1 million records?
>> It looks fine for me. The only note is that, I may be wrong, but
>> "UPDATE %s SET del = False" updates the records having del already
>> false too, taking more time and creating unnecessary bloat. Adding
>> "WHERE del" may help keeping the bloat to the minimum.
>>
>> -- Daniele
>>
>>
> You are referring to the first SQL command, and you are correct.  And on
> a table this size, time is a consideration; even if it is a one-time task.

An additional comment:

You really ought to consider using the parameter passing mechanism built into psycopg2:

http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries


See here for why that is important:

http://blog.endpoint.com/2012/06/detecting-postgres-sql-injection.html

>
> Thanks very much.
>
> Regards,
>
> Bill
>


--
Adrian Klaver
adrian.klaver@gmail.com

Re: Capacity questions

From

Daniele Varrazzo

Date:

18 June 2012, 09:28:58

On Sun, Jun 17, 2012 at 7:17 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote:

> An additional comment:
>
> You really ought to consider using the parameter passing mechanism built into psycopg2:

Oh, absolutely true. I'd only seen the first query, where the % was to
pass a table name, which cannot be used (yet) as a parameter, and
didn't notice it was later used for a parameter too.

-- Daniele

Re: Capacity questions

From

Federico Di Gregorio

Date:

18 June 2012, 10:04:26

On 17/06/12 17:32, Bill House wrote:
> I have a written a script which does what I want it to do on a table
> with 999 records.
>
> I was wondering if there is anything I need to consider if I run this
> script on a table with 1.1 million records?
>
> The purpose of the script is to flag records representing files if they
> are redundant (duplicate md5sum).
>
> Later this table will be used to move or remove the files represented by
> the flagged records.
>
> The script performs the following actions
>
>     1 resets the del column  to false for all records
>
>     2 creates a temporary table consisting of aggregate records of
> identical md5sums which count more than 1.
>
>     3 iterates down the cursor and with this information
>
>     4 updates the del column to True on all records in a batch of
> identical md5sums except for one.

If your recno column is a primary key, you can do this with a single SQL
UPDATE, without the initial update, temporary table or psycopg (with the
right indices this is blazing fast even on a gazzilion rows):

UPDATE files_test SET del = T.n > 1
FROM (SELECT recno,
             row_number() OVER (PARTITION BY md5sum) AS n
        FROM files_test) T
WHERE files_test.recno = T.recno;

Hope this helps,

federico (that *loves* UPDATE/FROM and window functions)

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
 - Ma cos'ha il tuo pesce rosso, l'orchite?
 - Si, ha un occhio solo, la voce roca e mangia gli altri pesci.

Re: Capacity questions

From

Bill House

Date:

19 June 2012, 02:24:06

On 06/18/2012 05:04 AM, Federico Di Gregorio wrote:
> On 17/06/12 17:32, Bill House wrote:
>> I have a written a script which does what I want it to do on a table
>> with 999 records.
>>
>> I was wondering if there is anything I need to consider if I run this
>> script on a table with 1.1 million records?
>>
>> The purpose of the script is to flag records representing files if they
>> are redundant (duplicate md5sum).
>>
>> Later this table will be used to move or remove the files represented by
>> the flagged records.
>>
>> The script performs the following actions
>>
>>     1 resets the del column  to false for all records
>>
>>     2 creates a temporary table consisting of aggregate records of
>> identical md5sums which count more than 1.
>>
>>     3 iterates down the cursor and with this information
>>
>>     4 updates the del column to True on all records in a batch of
>> identical md5sums except for one.
> If your recno column is a primary key, you can do this with a single SQL
> UPDATE, without the initial update, temporary table or psycopg (with the
> right indices this is blazing fast even on a gazzilion rows):
>
> UPDATE files_test SET del = T.n > 1
> FROM (SELECT recno,
>              row_number() OVER (PARTITION BY md5sum) AS n
>         FROM files_test) T
> WHERE files_test.recno = T.recno;
>
> Hope this helps,
>
> federico (that *loves* UPDATE/FROM and window functions)
>
Thanks for the input all.

Part of the purpose of this little project is for me to learn about both
Python and Postgresql; I know very little about either.  That is what
brings me to psycopg2.

I have read in the documents it is recommend that a programmer avoiding
parameter substitution by any means other than the one supplied by
psycopg2.  I understand the reasons and will comply if I create code for
public access.

I appreciate the opportunity to study the alternative more efficient
approach offered above.  As I said, I am new to SQL and have much to learn.

Thanks again,

Bill