Thread: New to the list; would this be an okay question?

New to the list; would this be an okay question?

From
Madison Kelly
Date:
Hi all,

   I am new to the list and I didn't want to seem rude at all so I
wanted to ask if this was okay first.

   I have a program I have written in perl which uses a postgresSQL
database as the backend. The program works but the performance is really
bad. I have been reading as much as I can on optimizing performance but
still it isn't very reasonable. At one point I had my program able to
process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM
test machine. Since then I got a Pentium3 1GHz, 512MB system and I have
tried a lot of things to get the performance up but now it is
substantially slower and I can't seem to figure out what I am doing wrong.

   Would it be appropriate to ask for help on my program on this list?
Full disclosure: The program won't be initially GPL'ed because it is for
my company but it will be released for free to home users and the source
code will be made available (similar to other split-license programs)
though once my company makes it's money back I think they will fully GPL
it (I am on my boss's case about it :p ).

   Thanks all!

Madison Kelly

Re: New to the list; would this be an okay question?

From
Martijn van Oosterhout
Date:
Standard questions:
- Have you VACUUMed?
- Have you VACUUM ANALYZEd?
- Have you done EXPLAIN ANALYZE on the complex queries?
- Have you put INDEXes on the appropriate columns.

You need to give more details is you want more detailed answers.

On Mon, Jun 21, 2004 at 09:38:14AM -0400, Madison Kelly wrote:
> Hi all,
>
>   I am new to the list and I didn't want to seem rude at all so I
> wanted to ask if this was okay first.
>
>   I have a program I have written in perl which uses a postgresSQL
> database as the backend. The program works but the performance is really
> bad. I have been reading as much as I can on optimizing performance but
> still it isn't very reasonable. At one point I had my program able to
> process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM
> test machine. Since then I got a Pentium3 1GHz, 512MB system and I have
> tried a lot of things to get the performance up but now it is
> substantially slower and I can't seem to figure out what I am doing wrong.
>
>   Would it be appropriate to ask for help on my program on this list?
> Full disclosure: The program won't be initially GPL'ed because it is for
> my company but it will be released for free to home users and the source
> code will be made available (similar to other split-license programs)
> though once my company makes it's money back I think they will fully GPL
> it (I am on my boss's case about it :p ).
>
>   Thanks all!
>
> Madison Kelly
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>      joining column's datatypes do not match

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: New to the list; would this be an okay question?

From
Richard Huxton
Date:
Madison Kelly wrote:
> Hi all,
>
>   I am new to the list and I didn't want to seem rude at all so I wanted
> to ask if this was okay first.

No problem. Reading your message below, you might want to try the
performance list, but general is a good place to start.

>   I have a program I have written in perl which uses a postgresSQL
> database as the backend. The program works but the performance is really
> bad. I have been reading as much as I can on optimizing performance but
> still it isn't very reasonable. At one point I had my program able to
> process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM
> test machine. Since then I got a Pentium3 1GHz, 512MB system and I have
> tried a lot of things to get the performance up but now it is
> substantially slower and I can't seem to figure out what I am doing wrong.

A few places to start:
1. VACUUM FULL
    This will make sure any unused space is reclaimed
2. ANALYZE
    This will recalculate stats for the tables
3. Basic performce tuning:
    http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php
    There's also a good guide to the postgresql.conf file on varlena.com

>   Would it be appropriate to ask for help on my program on this list?
> Full disclosure: The program won't be initially GPL'ed because it is for
> my company but it will be released for free to home users and the source
> code will be made available (similar to other split-license programs)
> though once my company makes it's money back I think they will fully GPL
> it (I am on my boss's case about it :p ).

No problem - what you licence your software under is your concern. Once
you've taken the basic steps described above, try to pick out a specific
query that you think is too slow and provide:

1. PostgreSQL version
2. Basic hardware info (as you have)
3. Sizes of tables.
4. Output of EXPLAIN ANALYZE <query here>

The EXPLAIN ANALYZE runs the query and shows how much work PG thought it
would be and how much it actually turned out to be.

HTH
--
   Richard Huxton
   Archonet Ltd

Re: New to the list; would this be an okay question?

From
Madison Kelly
Date:
Sorry; I didn't include details at first because I wanted to make sure
that was an approprate request for this list.

I have the program run 'VACUUM ANALYZE' after every major update/insert
job and I have in fact indexed the three columns that I search through
when I need to decide to update (if the record exists) or insert (if it
does not). I have read about the EXPLAIN option/tool but I haven't been
able to get my head around how to properly use it yet.

Here is what I am trying to do:

The program is a Linux backup program that uses a web front-end as it's
interface (so that a client can access it from any system, inc. MS
workstations). The program is also designed to allow the user to search
for a given file or file by spec (file size, date modified, etc) on
media that is offline. The program is designed with externally connected
USB2 and firewire drives so it is all partition-based.

In order to make the web front-end stateful and to allow for the ability
to search I needed to keep in the database detailed information on every
file and directory on a given partition. Some of the information also
needs to be maintained so I can't just clear and rescan a partition. For
example, I need to keep track of what directories and files a user has
selected or not selected to be backed up in a given partition. This
means that whenever I need to update the contents of a partition I need
to run 'ls' starting at the mount point for the partition and scanning
down through all sub directories.

As each file is scanned I check the database to see if the file name I
am looking at already exists in the database. I do this by searching for
the file_name, File_parent_dir (parent directory) and file_src_uuid (the
UUID [serial number] of the partition the file is on). If there is a
match I run an "UPDATE" where the backup state is not touched. If the
file is new then the file is added to the database along with all of
it's particular information such as owning user, group, permissions,
filesize and so on.

Given that some file systems have 250,000 files and directories I need
to make sure that the database calls are as optimized as I can make
them. I have verified that the lag is in the database by commenting out
the actual database calls and letting the program traverse the file
system. In that case a job that with the database calls in place and
took nearly 200 seconds finishes in roughly 2 seconds.

If this is an okay request I would be happy to post the schema I am
using and the perl code I am using to make the DB calls.

Thanks!!

Madison Kelly

Martijn van Oosterhout wrote:
> Standard questions:
> - Have you VACUUMed?
> - Have you VACUUM ANALYZEd?
> - Have you done EXPLAIN ANALYZE on the complex queries?
> - Have you put INDEXes on the appropriate columns.
>
> You need to give more details is you want more detailed answers.
>
> On Mon, Jun 21, 2004 at 09:38:14AM -0400, Madison Kelly wrote:
>
>>Hi all,
>>
>>  I am new to the list and I didn't want to seem rude at all so I
>>wanted to ask if this was okay first.
>>
>>  I have a program I have written in perl which uses a postgresSQL
>>database as the backend. The program works but the performance is really
>>bad. I have been reading as much as I can on optimizing performance but
>>still it isn't very reasonable. At one point I had my program able to
>>process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM
>>test machine. Since then I got a Pentium3 1GHz, 512MB system and I have
>>tried a lot of things to get the performance up but now it is
>>substantially slower and I can't seem to figure out what I am doing wrong.
>>
>>  Would it be appropriate to ask for help on my program on this list?
>>Full disclosure: The program won't be initially GPL'ed because it is for
>>my company but it will be released for free to home users and the source
>>code will be made available (similar to other split-license programs)
>>though once my company makes it's money back I think they will fully GPL
>>it (I am on my boss's case about it :p ).
>>
>>  Thanks all!
>>
>>Madison Kelly
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 9: the planner will ignore your desire to choose an index scan if your
>>     joining column's datatypes do not match
>
>


Re: New to the list; would this be an okay question?

From
Madison Kelly
Date:
Richard Huxton wrote:
> Madison Kelly wrote:
>
>> Hi all,
>>
>>   I am new to the list and I didn't want to seem rude at all so I
>> wanted to ask if this was okay first.
>
>
> No problem. Reading your message below, you might want to try the
> performance list, but general is a good place to start.
>
>>   I have a program I have written in perl which uses a postgresSQL
>> database as the backend. The program works but the performance is
>> really bad. I have been reading as much as I can on optimizing
>> performance but still it isn't very reasonable. At one point I had my
>> program able to process 175,000 records in 16min 10sec on a Pentium3
>> 650MHz, 448MB RAM test machine. Since then I got a Pentium3 1GHz,
>> 512MB system and I have tried a lot of things to get the performance
>> up but now it is substantially slower and I can't seem to figure out
>> what I am doing wrong.
>
>
> A few places to start:
> 1. VACUUM FULL
>    This will make sure any unused space is reclaimed
> 2. ANALYZE
>    This will recalculate stats for the tables
> 3. Basic performce tuning:
>    http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php
>    There's also a good guide to the postgresql.conf file on varlena.com
>
>>   Would it be appropriate to ask for help on my program on this list?
>> Full disclosure: The program won't be initially GPL'ed because it is
>> for my company but it will be released for free to home users and the
>> source code will be made available (similar to other split-license
>> programs) though once my company makes it's money back I think they
>> will fully GPL it (I am on my boss's case about it :p ).
>
>
> No problem - what you licence your software under is your concern. Once
> you've taken the basic steps described above, try to pick out a specific
> query that you think is too slow and provide:
>
> 1. PostgreSQL version
> 2. Basic hardware info (as you have)
> 3. Sizes of tables.
> 4. Output of EXPLAIN ANALYZE <query here>
>
> The EXPLAIN ANALYZE runs the query and shows how much work PG thought it
> would be and how much it actually turned out to be.
>
> HTH

    Thank you very much!! I am using Psql 7.4 on a stock install of Fedora
Core 2 on my IBM thinkpad a22m (P3 1GHz, 512MB RAM, not the fastest
HDD). The drive carrier I am using is connected via USB2 and uses a few
different hard drives with the fastest being a couple of Barracuda
7200.7 drives (2MB cache, 7,200rpm). I described the program in my reply
to Martijn so here is some of the code (code not related to psql
snipped, let me know if posting it would help - sorry for the wrapping...):

  =-[ Calling the database ]-=
# Open the connection to the database
my $DB = DBI->connect("DBI:Pg:dbname=$db_name","$user")|| die("Connect
error (Is PostgresSQL running?): $DBI::errstr");

# Prepare the select statements before using them for speed:
$select_sth = $DB->prepare("SELECT null FROM file_dir WHERE
file_src_uuid=? AND file_parent_dir=? AND file_name=?") || die
"$DBI::errstr";
$select_up = $DB->prepare("UPDATE file_dir SET file_perm=?,
file_own_user=?, file_own_grp=?, file_size=?, file_mod_date=?,
file_mod_time=?, file_mod_time_zone=?, file_exist=? WHERE
file_src_uuid=? AND file_parent_dir=? AND file_name=?") || die
"$DBI::errstr";
$select_in = $DB->prepare("INSERT INTO file_dir ( file_src_uuid,
file_name, file_dir, file_parent_dir, file_perm, file_own_user,
file_own_grp, file_size, file_mod_date, file_mod_time,
file_mod_time_zone, file_backup, file_restore, file_display, file_exist
) VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )") || die
"$DBI::errstr";

# Set the 'file_exist' flag to 'false' and reset exiting files to 'true'.
$DB->do("UPDATE file_dir SET file_exist='f' WHERE
file_src_uuid='$file_src_uuid'") || die "$DBI::errstr";

# Start scanning the drive
$num = $select_sth->execute($file_src_uuid,$relative_dir,$file_name) ||
die "$DBI::errstr";
if ( $num > 0 )
{

$select_up->execute($file_perm,$file_own_user,$file_own_grp,$file_size,$file_mod_date,$file_mod_time,$file_mod_time_zone,$file_exist,$file_src_uuid,$file_parent_dir,$file_name)

|| die "$DBI::errstr";
}
else
{


$select_in->execute($file_src_uuid,$file_name,$file_dir,$file_parent_dir,$file_perm,$file_own_user,$file_own_grp,$file_size,$file_mod_date,$file_mod_time,$file_mod_time_zone,$file_backup,$file_restore,$file_display,$file_exist)

|| die "$DBI::errstr";
}

# We need to grab the existing file settings for the special file '/.'
$DBreq=$DB->prepare("SELECT file_backup, file_restore, file_display FROM
file_dir WHERE file_parent_dir='/' AND file_name='.' AND
file_src_uuid='$file_src_uuid'") || die $DBI::errstr;
$file_backup=$DBreq->execute();
@file_backup=$DBreq->fetchrow_array();
$file_backup=@file_backup[0];
$file_restore=@file_backup[1];
$file_display=@file_backup[2];

# Jump into the re-entrant subroutine to scan directories and sub-dirs
&list_files($real_dir, $exclude_list_num, $relative_dir, $file_backup,
$file_restore, $file_display);

# Inside the sub routine

# Does the directory/file/symlink already exist? (there are three of
these for each file type)
$num = $select_sth->execute($file_src_uuid,$relative_dir,$file_name) ||
die "$DBI::errstr";
if ( $num > 0 )
{


$select_up->execute($file_perm,$file_own_user,$file_own_grp,$file_size,$file_mod_date,$file_mod_time,$file_mod_time_zone,$file_exist,$file_src_uuid,$file_parent_dir,$file_name)

|| die "$DBI::errstr";
}
else
{
     # The file did not exist so we will use the passed parent settings
for the 'file_backup' flag and leave the 'file_display' flag set to 'f'


$select_in->execute($file_src_uuid,$file_name,$file_dir,$file_parent_dir,$file_perm,$file_own_user,$file_own_grp,$file_size,$file_mod_date,$file_mod_time,$file_mod_time_zone,$file_backup,$file_restore,$file_display,$file_exist)

|| die "$DBI::errstr";
}

# If this was a file I would loop and process the next file in the
directory, if it was a directory itself I would now re-enter the
subroutine to process it's contents and when I fell back I would pick up
where I left off

# Returning from the final subroutine and finishing up

$DB->do("VACUUM ANALYZE");

  =-[ finished DB related source code ]-=

Here is the schema for the 'file_dir' table which I hit repeatedly here:

  =-[ file_dir table and index schemas ]-=

CREATE TABLE file_dir (                            -- Used to store info
on every file on source partitions
     file_id            serial        unique,                -- make
this 'bigserial' if there may be more than 2 billion files in the database
     file_src_uuid        varchar(40)    not null,            -- the
UUID of the source partition hosting the original file
     file_org_uuid        varchar(40),                    -- the UUID
that the file came from (when the file was moved by TLE-BU)
     file_name        varchar(255)    not null,            -- Name of
the file or directory
     file_dir        bool        not null,            -- t = is
directory, f = file
     file_parent_dir        varchar(255)    not null,            -- if
directory '/foo/bar', parent is '/foo', if file '/foo/bar/file', parent
is '/foo/bar'. The mount directory is treated as '/' so any directories
below it will be ignored for this record.
     file_perm        varchar(10)    not null,            -- file or
directory permissions
     file_own_user        varchar(255)    not null,            -- The
file's owning user (by name, not UID!!)
     file_own_grp        varchar(255)    not null,            -- The
file's owning group (by name, not GID!!)
     file_size        bigint        not null,            -- File size in
bytes
     file_mod_date        varchar(12)    not null,            -- File's
last edited date
     file_mod_time        varchar(20)    not null,            -- File's
last edited time
     file_mod_time_zone    varchar(6)    not null,            -- File's
last edited time zone
     file_backup        boolean        not null    default 'f',    --
't' = Include in backup jobs, 'f' = Do not include in backup jobs
     file_restore        boolean        not null    default 'f',    --
't' = Include in restore jobs, 'f' = Do not include in restore jobs
     file_display        boolean        not null    default 'f',    --
't' = display, 'f' = hide
     file_exist        boolean        default 't'            -- Used to
catch files that have been deleted since the last scan. Before rescan,
all files in a given src_uuid are set to 0 (deleted) and then as each
file is found or updated it is reset back to 1 (exists) and anything
left with a value of '0' at the end of the scan is deleted and we will
remove their record.
);

-- CREATE INDEX file_dir_idx ON file_dir
(file_src_uuid,file_name,file_parent_dir);

  =-[ Finish file_dir table and index schemas ]-=

Thanks so much!!

Madison


Re: New to the list; would this be an okay question?

From
Richard Huxton
Date:
Madison Kelly wrote:
> Richard Huxton wrote:
>
>> Madison Kelly wrote:
>>
>>> Hi all,
>>>
>>>   I am new to the list and I didn't want to seem rude at all so I
>>> wanted to ask if this was okay first.
>>
>>
>>
>> No problem. Reading your message below, you might want to try the
>> performance list, but general is a good place to start.
>>
>>>   I have a program I have written in perl which uses a postgresSQL
>>> database as the backend. The program works but the performance is
>>> really bad. I have been reading as much as I can on optimizing
>>> performance but still it isn't very reasonable. At one point I had my
>>> program able to process 175,000 records in 16min 10sec on a Pentium3
>>> 650MHz, 448MB RAM test machine. Since then I got a Pentium3 1GHz,
>>> 512MB system and I have tried a lot of things to get the performance
>>> up but now it is substantially slower and I can't seem to figure out
>>> what I am doing wrong.
>>
>>
>>
>> A few places to start:
>> 1. VACUUM FULL
>>    This will make sure any unused space is reclaimed
>> 2. ANALYZE
>>    This will recalculate stats for the tables
>> 3. Basic performce tuning:
>>    http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php
>>    There's also a good guide to the postgresql.conf file on varlena.com
>>
>>>   Would it be appropriate to ask for help on my program on this list?
>>> Full disclosure: The program won't be initially GPL'ed because it is
>>> for my company but it will be released for free to home users and the
>>> source code will be made available (similar to other split-license
>>> programs) though once my company makes it's money back I think they
>>> will fully GPL it (I am on my boss's case about it :p ).
>>
>>
>>
>> No problem - what you licence your software under is your concern.
>> Once you've taken the basic steps described above, try to pick out a
>> specific query that you think is too slow and provide:
>>
>> 1. PostgreSQL version
>> 2. Basic hardware info (as you have)
>> 3. Sizes of tables.
>> 4. Output of EXPLAIN ANALYZE <query here>
>>
>> The EXPLAIN ANALYZE runs the query and shows how much work PG thought
>> it would be and how much it actually turned out to be.
>>
>> HTH
>
>
>     Thank you very much!! I am using Psql 7.4 on a stock install of
> Fedora Core 2 on my IBM thinkpad a22m (P3 1GHz, 512MB RAM, not the
> fastest HDD). The drive carrier I am using is connected via USB2 and
> uses a few different hard drives with the fastest being a couple of
> Barracuda 7200.7 drives (2MB cache, 7,200rpm). I described the program
> in my reply to Martijn so here is some of the code (code not related to
> psql snipped, let me know if posting it would help - sorry for the
> wrapping...):

I'm not clear if the database is on the local disk or attached to the
USB2. Not sure it's important, since neither will be that fast.

If I understand, you scan thousands or millions of files for backup
purposes and then issue a select + update/insert for each.

Once a partition is scanned, a flag is cleared on all rows.

Once all selected files have been dealt with a vaccum/analyse is issued.


Some things to look at:
1. How many files are you handling per second? Are the disks involved in
the backup as well as the database?
2. What does the output of "vmstat 10" show when the system is running.
Is your I/O saturated? CPU?
3. Is your main index (file_src_uuid,file_name,file_parent_dir) being
used? Your best bet is to select from "pg_stat_indexes" before and after.
4. If you are updating several hundred thousand rows then you probably
don't have enought vacuum memory set aside - try a vacuum full after
each set of updates.
5. You might want to batch together queries into transactions of a few
hundred or even few thousand updates.
--
   Richard Huxton
   Archonet Ltd

Re: New to the list; would this be an okay question?

From
Christopher Browne
Date:
After a long battle with technology, dev@archonet.com (Richard Huxton), an earthling, wrote:
> 5. You might want to batch together queries into transactions of a few
> hundred or even few thousand updates.

When this particular application got discussed on local LUG mailing
list, this emerged as being one of the factors most likely to be a Big
Deal.
--
If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me
http://www3.sympatico.ca/cbbrowne/lsf.html
"A   hack  is a   terrible   thing  to  waste,    please  give to  the
implementation of your choice..." -- GJC

Re: New to the list; would this be an okay question?

From
Richard Huxton
Date:
Christopher Browne wrote:
> After a long battle with technology, dev@archonet.com (Richard Huxton), an earthling, wrote:
>
>>5. You might want to batch together queries into transactions of a few
>>hundred or even few thousand updates.
>
>
> When this particular application got discussed on local LUG mailing
> list, this emerged as being one of the factors most likely to be a Big
> Deal.

Yep, except... Madison said a laptop was involved, so I'm guessing it's
an IDE drive lying about sync-ing. If fsync is effectively off that
shouldn't have such a huge effect should it?

--
   Richard Huxton
   Archonet Ltd

Re: New to the list; would this be an okay question?

From
Martijn van Oosterhout
Date:
On Mon, Jun 21, 2004 at 08:29:54PM +0100, Richard Huxton wrote:
> Christopher Browne wrote:
> >When this particular application got discussed on local LUG mailing
> >list, this emerged as being one of the factors most likely to be a Big
> >Deal.
>
> Yep, except... Madison said a laptop was involved, so I'm guessing it's
> an IDE drive lying about sync-ing. If fsync is effectively off that
> shouldn't have such a huge effect should it?

The IDE drive lying about syncing is different from fsync being turned
off. What the drive thinks doesn't matter until after Postgres has
written the WAL, closed the transaction and written the pages out. The
fsync will still cause Linux to wait for all the data to be written to
the disk, which is still a finite amount of time, the disk buffer is
only a few MB. Turning fsync off means Linux will never wait, just
buffer in system memory. Similarly, putting it all in one transaction
means that within the transaction there is no waiting, only in
transaction commit.

With fsync on, in/not it transaction can make a really big difference.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: New to the list; would this be an okay question?

From
Madison Kelly
Date:
Richard Huxton wrote:
> Madison Kelly wrote:
>
>> Richard Huxton wrote:
>>
>>> Madison Kelly wrote:
>>>
>>>> Hi all,
>>>>
>>>>   I am new to the list and I didn't want to seem rude at all so I
>>>> wanted to ask if this was okay first.
>>>
>>>
>>>
>>>
>>> No problem. Reading your message below, you might want to try the
>>> performance list, but general is a good place to start.
>>>
>>>>   I have a program I have written in perl which uses a postgresSQL
>>>> database as the backend. The program works but the performance is
>>>> really bad. I have been reading as much as I can on optimizing
>>>> performance but still it isn't very reasonable. At one point I had
>>>> my program able to process 175,000 records in 16min 10sec on a
>>>> Pentium3 650MHz, 448MB RAM test machine. Since then I got a Pentium3
>>>> 1GHz, 512MB system and I have tried a lot of things to get the
>>>> performance up but now it is substantially slower and I can't seem
>>>> to figure out what I am doing wrong.
>>>
>>>
>>>
>>>
>>> A few places to start:
>>> 1. VACUUM FULL
>>>    This will make sure any unused space is reclaimed
>>> 2. ANALYZE
>>>    This will recalculate stats for the tables
>>> 3. Basic performce tuning:
>>>    http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php
>>>    There's also a good guide to the postgresql.conf file on varlena.com
>>>
>>>>   Would it be appropriate to ask for help on my program on this
>>>> list? Full disclosure: The program won't be initially GPL'ed because
>>>> it is for my company but it will be released for free to home users
>>>> and the source code will be made available (similar to other
>>>> split-license programs) though once my company makes it's money back
>>>> I think they will fully GPL it (I am on my boss's case about it :p ).
>>>
>>>
>>>
>>>
>>> No problem - what you licence your software under is your concern.
>>> Once you've taken the basic steps described above, try to pick out a
>>> specific query that you think is too slow and provide:
>>>
>>> 1. PostgreSQL version
>>> 2. Basic hardware info (as you have)
>>> 3. Sizes of tables.
>>> 4. Output of EXPLAIN ANALYZE <query here>
>>>
>>> The EXPLAIN ANALYZE runs the query and shows how much work PG thought
>>> it would be and how much it actually turned out to be.
>>>
>>> HTH
>>
>>
>>
>>     Thank you very much!! I am using Psql 7.4 on a stock install of
>> Fedora Core 2 on my IBM thinkpad a22m (P3 1GHz, 512MB RAM, not the
>> fastest HDD). The drive carrier I am using is connected via USB2 and
>> uses a few different hard drives with the fastest being a couple of
>> Barracuda 7200.7 drives (2MB cache, 7,200rpm). I described the program
>> in my reply to Martijn so here is some of the code (code not related
>> to psql snipped, let me know if posting it would help - sorry for the
>> wrapping...):
>
>
> I'm not clear if the database is on the local disk or attached to the
> USB2. Not sure it's important, since neither will be that fast.
>
> If I understand, you scan thousands or millions of files for backup
> purposes and then issue a select + update/insert for each.
>
> Once a partition is scanned, a flag is cleared on all rows.
>
> Once all selected files have been dealt with a vaccum/analyse is issued.
>
>
> Some things to look at:
> 1. How many files are you handling per second? Are the disks involved in
> the backup as well as the database?
> 2. What does the output of "vmstat 10" show when the system is running.
> Is your I/O saturated? CPU?
> 3. Is your main index (file_src_uuid,file_name,file_parent_dir) being
> used? Your best bet is to select from "pg_stat_indexes" before and after.
> 4. If you are updating several hundred thousand rows then you probably
> don't have enought vacuum memory set aside - try a vacuum full after
> each set of updates.
> 5. You might want to batch together queries into transactions of a few
> hundred or even few thousand updates.

Hi Richard,

   The database is on the system's local disk and the destination drives
are on the USB-connected drives. It is possible to include the server's
local disk in a backup job. Without the DB calls I was able to process
~4000 files in ~2secs.

   'vmstat 10' shows (system running):
r 0; b 0; swapd 3396; free 192356; buff 7084; cache 186508; si 0; so 0;
bi 0; bo 5; in 1023; cs 82; us 1; sy 0; id 99; wa 0

   'vmstat 10' shows (while the program is running):
r 1; b 0; swapd 480; free 180252; buff 8600; cache 195412; si 0; so 0;
bi 6; bo 163; in 1037; cs 281; us 97; sy 3; id 0; wa 0

   I'm afraid that I am pretty new to postgres (and programming period)
so I am not sure if the indexes are being used. I assumed they where
simple because I created them but then again assumptions always get me
in trouble. :p When you say "select from "pg_stat_indexes"" do you mean
select from there those three field instead of 'SELECT <blah> FROM
file_dir'?.

   I currently have "VACUUM ANALYZE" set to run after every big set of
UPDATE/INSERTs but that isn't enough? Would it also help if I kept a
count of how many records had been processed and running "VACUUM
{ANALYZE:FULL}" every so many 'x' records?

   I have been trying to bunch jobs into a single transaction but for
some reason whenever I try to do that PERL dies on me with an empty
"$DBI:errstr" error (it shows what line it died on but it is blank where
it normally says what went wrong). I was trying to do this like this:

$DB->do("BEGIN TRANSACTION") || die $DBI::errstr;
# Jump into the sub-routine:
&list_files($real_dir, $exclude_list_num, $relative_dir, $file_backup,
$file_restore, $file_display);
$DB->do("COMMIT;") || die $DBI::errstr;

Inside the subroutine I kept a count of how much time went by and to
commit the current transaction every 250secs and start a new transaction
block:

if ( $rec_num > $rec_disp )
{
     $rec_disp=($rec_disp+500);
     $nowtime=time;
     $processtime=($nowtime - $starttime);
     if ( $processtime >= $committime )
     {
         $committime=$committime+250;
         $DB->do("COMMIT") || die $DBI::errstr;
         $DB->do("BEGIN TRANSACTION") || die $DBI::errstr;
         system 'echo " |- Committed processed records to the database,
updating committime to \''.$committime.'\' and proceeding:" >> '.$log;
     }
system 'echo " |- Processing record number \''.$rec_num.'\',
\''.$processtime.'\' seconds into operation." >> '.$log;
}


   Is what I am doing wrong obvious at all? When I enter "BEGIN
TRANSACTION" directly into 'psql' it seems to work fine so I am sure the
syntax is right. Is it a perl prolem maybe?

   Thank you so much for your help!!

Madison

Re: New to the list; would this be an okay question?

From
Richard Huxton
Date:
Madison Kelly wrote:
>
>   The database is on the system's local disk and the destination drives
> are on the USB-connected drives. It is possible to include the server's
> local disk in a backup job. Without the DB calls I was able to process
> ~4000 files in ~2secs.
>
>   'vmstat 10' shows (system running):
> r 0; b 0; swapd 3396; free 192356; buff 7084; cache 186508; si 0; so 0;
> bi 0; bo 5; in 1023; cs 82; us 1; sy 0; id 99; wa 0
>
>   'vmstat 10' shows (while the program is running):
> r 1; b 0; swapd 480; free 180252; buff 8600; cache 195412; si 0; so 0;
> bi 6; bo 163; in 1037; cs 281; us 97; sy 3; id 0; wa 0

Hmm - well, your CPU usage is up to 97% userland (us 97) and your disk
activity is up (bi 6; bo 163) as you'd expect.

>   I'm afraid that I am pretty new to postgres (and programming period)
> so I am not sure if the indexes are being used. I assumed they where
> simple because I created them but then again assumptions always get me
> in trouble. :p When you say "select from "pg_stat_indexes"" do you mean
> select from there those three field instead of 'SELECT <blah> FROM
> file_dir'?.

No - there are a set of statistics views called pg_stat_xxx (see ch 23.2
of the online manuals). For example, I have a unique index on
content_core.cc_id:
SELECT relname,indexrelname,idx_scan,idx_tup_read,idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'content_core';
    indexrelname    | idx_scan | idx_tup_read | idx_tup_fetch
-------------------+----------+--------------+---------------
  content_core_pkey |      717 |          717 |           717
(1 row)

SELECT * FROM content_core WHERE cc_id=2;
...

SELECT relname,indexrelname,idx_scan,idx_tup_read,idx_tup_fetch FROM
pg_stat_user_indexes
WHERE relname = 'content_core';
    indexrelname    | idx_scan | idx_tup_read | idx_tup_fetch
-------------------+----------+--------------+---------------
  content_core_pkey |      718 |          718 |           718


As you can see, the index was used for this query.


>   I currently have "VACUUM ANALYZE" set to run after every big set of
> UPDATE/INSERTs but that isn't enough? Would it also help if I kept a
> count of how many records had been processed and running "VACUUM
> {ANALYZE:FULL}" every so many 'x' records?
>
>   I have been trying to bunch jobs into a single transaction but for
> some reason whenever I try to do that PERL dies on me with an empty
> "$DBI:errstr" error (it shows what line it died on but it is blank where
> it normally says what went wrong). I was trying to do this like this:
>
> $DB->do("BEGIN TRANSACTION") || die $DBI::errstr;
> # Jump into the sub-routine:
> &list_files($real_dir, $exclude_list_num, $relative_dir, $file_backup,
> $file_restore, $file_display);
> $DB->do("COMMIT;") || die $DBI::errstr;

You can do this, I think better practice is supposed to be:
   $DB->begin_work;
   ...
   $DB->commit;

You might want to turn on statement logging for PostgreSQL's logs -
details in the manuals (Ch 16.4)

> Inside the subroutine I kept a count of how much time went by and to
> commit the current transaction every 250secs and start a new transaction
> block:
>
> if ( $rec_num > $rec_disp )
> {
>     $rec_disp=($rec_disp+500);
>     $nowtime=time;
>     $processtime=($nowtime - $starttime);
>     if ( $processtime >= $committime )
>     {
>         $committime=$committime+250;
>         $DB->do("COMMIT") || die $DBI::errstr;
>         $DB->do("BEGIN TRANSACTION") || die $DBI::errstr;
>         system 'echo " |- Committed processed records to the database,
> updating committime to \''.$committime.'\' and proceeding:" >> '.$log;
>     }
> system 'echo " |- Processing record number \''.$rec_num.'\',
> \''.$processtime.'\' seconds into operation." >> '.$log;
> }
>
>
>   Is what I am doing wrong obvious at all? When I enter "BEGIN
> TRANSACTION" directly into 'psql' it seems to work fine so I am sure the
> syntax is right. Is it a perl prolem maybe?

Nothing leaping out at me - although I'd do something like:
   use IO::File;
   my $log = new IO::File("> logfile.txt");
   print $log "Processing record number $rec_num\n";
   $log->close;
Or even just:
   print STDERR "Processing...";
Note there is just a space between the filehandle and string to print.

Turn logging up in PostgreSQL (which will slow things down of course)
and then get the transaction blocks working, and we'll see what happens
then. Oh, and don't forget the tuning document on varlena.com

--
   Richard Huxton
   Archonet Ltd