Thread: New to the list; would this be an okay question?
Hi all, I am new to the list and I didn't want to seem rude at all so I wanted to ask if this was okay first. I have a program I have written in perl which uses a postgresSQL database as the backend. The program works but the performance is really bad. I have been reading as much as I can on optimizing performance but still it isn't very reasonable. At one point I had my program able to process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM test machine. Since then I got a Pentium3 1GHz, 512MB system and I have tried a lot of things to get the performance up but now it is substantially slower and I can't seem to figure out what I am doing wrong. Would it be appropriate to ask for help on my program on this list? Full disclosure: The program won't be initially GPL'ed because it is for my company but it will be released for free to home users and the source code will be made available (similar to other split-license programs) though once my company makes it's money back I think they will fully GPL it (I am on my boss's case about it :p ). Thanks all! Madison Kelly
Standard questions: - Have you VACUUMed? - Have you VACUUM ANALYZEd? - Have you done EXPLAIN ANALYZE on the complex queries? - Have you put INDEXes on the appropriate columns. You need to give more details is you want more detailed answers. On Mon, Jun 21, 2004 at 09:38:14AM -0400, Madison Kelly wrote: > Hi all, > > I am new to the list and I didn't want to seem rude at all so I > wanted to ask if this was okay first. > > I have a program I have written in perl which uses a postgresSQL > database as the backend. The program works but the performance is really > bad. I have been reading as much as I can on optimizing performance but > still it isn't very reasonable. At one point I had my program able to > process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM > test machine. Since then I got a Pentium3 1GHz, 512MB system and I have > tried a lot of things to get the performance up but now it is > substantially slower and I can't seem to figure out what I am doing wrong. > > Would it be appropriate to ask for help on my program on this list? > Full disclosure: The program won't be initially GPL'ed because it is for > my company but it will be released for free to home users and the source > code will be made available (similar to other split-license programs) > though once my company makes it's money back I think they will fully GPL > it (I am on my boss's case about it :p ). > > Thanks all! > > Madison Kelly > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Madison Kelly wrote: > Hi all, > > I am new to the list and I didn't want to seem rude at all so I wanted > to ask if this was okay first. No problem. Reading your message below, you might want to try the performance list, but general is a good place to start. > I have a program I have written in perl which uses a postgresSQL > database as the backend. The program works but the performance is really > bad. I have been reading as much as I can on optimizing performance but > still it isn't very reasonable. At one point I had my program able to > process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM > test machine. Since then I got a Pentium3 1GHz, 512MB system and I have > tried a lot of things to get the performance up but now it is > substantially slower and I can't seem to figure out what I am doing wrong. A few places to start: 1. VACUUM FULL This will make sure any unused space is reclaimed 2. ANALYZE This will recalculate stats for the tables 3. Basic performce tuning: http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php There's also a good guide to the postgresql.conf file on varlena.com > Would it be appropriate to ask for help on my program on this list? > Full disclosure: The program won't be initially GPL'ed because it is for > my company but it will be released for free to home users and the source > code will be made available (similar to other split-license programs) > though once my company makes it's money back I think they will fully GPL > it (I am on my boss's case about it :p ). No problem - what you licence your software under is your concern. Once you've taken the basic steps described above, try to pick out a specific query that you think is too slow and provide: 1. PostgreSQL version 2. Basic hardware info (as you have) 3. Sizes of tables. 4. Output of EXPLAIN ANALYZE <query here> The EXPLAIN ANALYZE runs the query and shows how much work PG thought it would be and how much it actually turned out to be. HTH -- Richard Huxton Archonet Ltd
Sorry; I didn't include details at first because I wanted to make sure that was an approprate request for this list. I have the program run 'VACUUM ANALYZE' after every major update/insert job and I have in fact indexed the three columns that I search through when I need to decide to update (if the record exists) or insert (if it does not). I have read about the EXPLAIN option/tool but I haven't been able to get my head around how to properly use it yet. Here is what I am trying to do: The program is a Linux backup program that uses a web front-end as it's interface (so that a client can access it from any system, inc. MS workstations). The program is also designed to allow the user to search for a given file or file by spec (file size, date modified, etc) on media that is offline. The program is designed with externally connected USB2 and firewire drives so it is all partition-based. In order to make the web front-end stateful and to allow for the ability to search I needed to keep in the database detailed information on every file and directory on a given partition. Some of the information also needs to be maintained so I can't just clear and rescan a partition. For example, I need to keep track of what directories and files a user has selected or not selected to be backed up in a given partition. This means that whenever I need to update the contents of a partition I need to run 'ls' starting at the mount point for the partition and scanning down through all sub directories. As each file is scanned I check the database to see if the file name I am looking at already exists in the database. I do this by searching for the file_name, File_parent_dir (parent directory) and file_src_uuid (the UUID [serial number] of the partition the file is on). If there is a match I run an "UPDATE" where the backup state is not touched. If the file is new then the file is added to the database along with all of it's particular information such as owning user, group, permissions, filesize and so on. Given that some file systems have 250,000 files and directories I need to make sure that the database calls are as optimized as I can make them. I have verified that the lag is in the database by commenting out the actual database calls and letting the program traverse the file system. In that case a job that with the database calls in place and took nearly 200 seconds finishes in roughly 2 seconds. If this is an okay request I would be happy to post the schema I am using and the perl code I am using to make the DB calls. Thanks!! Madison Kelly Martijn van Oosterhout wrote: > Standard questions: > - Have you VACUUMed? > - Have you VACUUM ANALYZEd? > - Have you done EXPLAIN ANALYZE on the complex queries? > - Have you put INDEXes on the appropriate columns. > > You need to give more details is you want more detailed answers. > > On Mon, Jun 21, 2004 at 09:38:14AM -0400, Madison Kelly wrote: > >>Hi all, >> >> I am new to the list and I didn't want to seem rude at all so I >>wanted to ask if this was okay first. >> >> I have a program I have written in perl which uses a postgresSQL >>database as the backend. The program works but the performance is really >>bad. I have been reading as much as I can on optimizing performance but >>still it isn't very reasonable. At one point I had my program able to >>process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM >>test machine. Since then I got a Pentium3 1GHz, 512MB system and I have >>tried a lot of things to get the performance up but now it is >>substantially slower and I can't seem to figure out what I am doing wrong. >> >> Would it be appropriate to ask for help on my program on this list? >>Full disclosure: The program won't be initially GPL'ed because it is for >>my company but it will be released for free to home users and the source >>code will be made available (similar to other split-license programs) >>though once my company makes it's money back I think they will fully GPL >>it (I am on my boss's case about it :p ). >> >> Thanks all! >> >>Madison Kelly >> >>---------------------------(end of broadcast)--------------------------- >>TIP 9: the planner will ignore your desire to choose an index scan if your >> joining column's datatypes do not match > >
Richard Huxton wrote: > Madison Kelly wrote: > >> Hi all, >> >> I am new to the list and I didn't want to seem rude at all so I >> wanted to ask if this was okay first. > > > No problem. Reading your message below, you might want to try the > performance list, but general is a good place to start. > >> I have a program I have written in perl which uses a postgresSQL >> database as the backend. The program works but the performance is >> really bad. I have been reading as much as I can on optimizing >> performance but still it isn't very reasonable. At one point I had my >> program able to process 175,000 records in 16min 10sec on a Pentium3 >> 650MHz, 448MB RAM test machine. Since then I got a Pentium3 1GHz, >> 512MB system and I have tried a lot of things to get the performance >> up but now it is substantially slower and I can't seem to figure out >> what I am doing wrong. > > > A few places to start: > 1. VACUUM FULL > This will make sure any unused space is reclaimed > 2. ANALYZE > This will recalculate stats for the tables > 3. Basic performce tuning: > http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php > There's also a good guide to the postgresql.conf file on varlena.com > >> Would it be appropriate to ask for help on my program on this list? >> Full disclosure: The program won't be initially GPL'ed because it is >> for my company but it will be released for free to home users and the >> source code will be made available (similar to other split-license >> programs) though once my company makes it's money back I think they >> will fully GPL it (I am on my boss's case about it :p ). > > > No problem - what you licence your software under is your concern. Once > you've taken the basic steps described above, try to pick out a specific > query that you think is too slow and provide: > > 1. PostgreSQL version > 2. Basic hardware info (as you have) > 3. Sizes of tables. > 4. Output of EXPLAIN ANALYZE <query here> > > The EXPLAIN ANALYZE runs the query and shows how much work PG thought it > would be and how much it actually turned out to be. > > HTH Thank you very much!! I am using Psql 7.4 on a stock install of Fedora Core 2 on my IBM thinkpad a22m (P3 1GHz, 512MB RAM, not the fastest HDD). The drive carrier I am using is connected via USB2 and uses a few different hard drives with the fastest being a couple of Barracuda 7200.7 drives (2MB cache, 7,200rpm). I described the program in my reply to Martijn so here is some of the code (code not related to psql snipped, let me know if posting it would help - sorry for the wrapping...): =-[ Calling the database ]-= # Open the connection to the database my $DB = DBI->connect("DBI:Pg:dbname=$db_name","$user")|| die("Connect error (Is PostgresSQL running?): $DBI::errstr"); # Prepare the select statements before using them for speed: $select_sth = $DB->prepare("SELECT null FROM file_dir WHERE file_src_uuid=? AND file_parent_dir=? AND file_name=?") || die "$DBI::errstr"; $select_up = $DB->prepare("UPDATE file_dir SET file_perm=?, file_own_user=?, file_own_grp=?, file_size=?, file_mod_date=?, file_mod_time=?, file_mod_time_zone=?, file_exist=? WHERE file_src_uuid=? AND file_parent_dir=? AND file_name=?") || die "$DBI::errstr"; $select_in = $DB->prepare("INSERT INTO file_dir ( file_src_uuid, file_name, file_dir, file_parent_dir, file_perm, file_own_user, file_own_grp, file_size, file_mod_date, file_mod_time, file_mod_time_zone, file_backup, file_restore, file_display, file_exist ) VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )") || die "$DBI::errstr"; # Set the 'file_exist' flag to 'false' and reset exiting files to 'true'. $DB->do("UPDATE file_dir SET file_exist='f' WHERE file_src_uuid='$file_src_uuid'") || die "$DBI::errstr"; # Start scanning the drive $num = $select_sth->execute($file_src_uuid,$relative_dir,$file_name) || die "$DBI::errstr"; if ( $num > 0 ) { $select_up->execute($file_perm,$file_own_user,$file_own_grp,$file_size,$file_mod_date,$file_mod_time,$file_mod_time_zone,$file_exist,$file_src_uuid,$file_parent_dir,$file_name) || die "$DBI::errstr"; } else { $select_in->execute($file_src_uuid,$file_name,$file_dir,$file_parent_dir,$file_perm,$file_own_user,$file_own_grp,$file_size,$file_mod_date,$file_mod_time,$file_mod_time_zone,$file_backup,$file_restore,$file_display,$file_exist) || die "$DBI::errstr"; } # We need to grab the existing file settings for the special file '/.' $DBreq=$DB->prepare("SELECT file_backup, file_restore, file_display FROM file_dir WHERE file_parent_dir='/' AND file_name='.' AND file_src_uuid='$file_src_uuid'") || die $DBI::errstr; $file_backup=$DBreq->execute(); @file_backup=$DBreq->fetchrow_array(); $file_backup=@file_backup[0]; $file_restore=@file_backup[1]; $file_display=@file_backup[2]; # Jump into the re-entrant subroutine to scan directories and sub-dirs &list_files($real_dir, $exclude_list_num, $relative_dir, $file_backup, $file_restore, $file_display); # Inside the sub routine # Does the directory/file/symlink already exist? (there are three of these for each file type) $num = $select_sth->execute($file_src_uuid,$relative_dir,$file_name) || die "$DBI::errstr"; if ( $num > 0 ) { $select_up->execute($file_perm,$file_own_user,$file_own_grp,$file_size,$file_mod_date,$file_mod_time,$file_mod_time_zone,$file_exist,$file_src_uuid,$file_parent_dir,$file_name) || die "$DBI::errstr"; } else { # The file did not exist so we will use the passed parent settings for the 'file_backup' flag and leave the 'file_display' flag set to 'f' $select_in->execute($file_src_uuid,$file_name,$file_dir,$file_parent_dir,$file_perm,$file_own_user,$file_own_grp,$file_size,$file_mod_date,$file_mod_time,$file_mod_time_zone,$file_backup,$file_restore,$file_display,$file_exist) || die "$DBI::errstr"; } # If this was a file I would loop and process the next file in the directory, if it was a directory itself I would now re-enter the subroutine to process it's contents and when I fell back I would pick up where I left off # Returning from the final subroutine and finishing up $DB->do("VACUUM ANALYZE"); =-[ finished DB related source code ]-= Here is the schema for the 'file_dir' table which I hit repeatedly here: =-[ file_dir table and index schemas ]-= CREATE TABLE file_dir ( -- Used to store info on every file on source partitions file_id serial unique, -- make this 'bigserial' if there may be more than 2 billion files in the database file_src_uuid varchar(40) not null, -- the UUID of the source partition hosting the original file file_org_uuid varchar(40), -- the UUID that the file came from (when the file was moved by TLE-BU) file_name varchar(255) not null, -- Name of the file or directory file_dir bool not null, -- t = is directory, f = file file_parent_dir varchar(255) not null, -- if directory '/foo/bar', parent is '/foo', if file '/foo/bar/file', parent is '/foo/bar'. The mount directory is treated as '/' so any directories below it will be ignored for this record. file_perm varchar(10) not null, -- file or directory permissions file_own_user varchar(255) not null, -- The file's owning user (by name, not UID!!) file_own_grp varchar(255) not null, -- The file's owning group (by name, not GID!!) file_size bigint not null, -- File size in bytes file_mod_date varchar(12) not null, -- File's last edited date file_mod_time varchar(20) not null, -- File's last edited time file_mod_time_zone varchar(6) not null, -- File's last edited time zone file_backup boolean not null default 'f', -- 't' = Include in backup jobs, 'f' = Do not include in backup jobs file_restore boolean not null default 'f', -- 't' = Include in restore jobs, 'f' = Do not include in restore jobs file_display boolean not null default 'f', -- 't' = display, 'f' = hide file_exist boolean default 't' -- Used to catch files that have been deleted since the last scan. Before rescan, all files in a given src_uuid are set to 0 (deleted) and then as each file is found or updated it is reset back to 1 (exists) and anything left with a value of '0' at the end of the scan is deleted and we will remove their record. ); -- CREATE INDEX file_dir_idx ON file_dir (file_src_uuid,file_name,file_parent_dir); =-[ Finish file_dir table and index schemas ]-= Thanks so much!! Madison
Madison Kelly wrote: > Richard Huxton wrote: > >> Madison Kelly wrote: >> >>> Hi all, >>> >>> I am new to the list and I didn't want to seem rude at all so I >>> wanted to ask if this was okay first. >> >> >> >> No problem. Reading your message below, you might want to try the >> performance list, but general is a good place to start. >> >>> I have a program I have written in perl which uses a postgresSQL >>> database as the backend. The program works but the performance is >>> really bad. I have been reading as much as I can on optimizing >>> performance but still it isn't very reasonable. At one point I had my >>> program able to process 175,000 records in 16min 10sec on a Pentium3 >>> 650MHz, 448MB RAM test machine. Since then I got a Pentium3 1GHz, >>> 512MB system and I have tried a lot of things to get the performance >>> up but now it is substantially slower and I can't seem to figure out >>> what I am doing wrong. >> >> >> >> A few places to start: >> 1. VACUUM FULL >> This will make sure any unused space is reclaimed >> 2. ANALYZE >> This will recalculate stats for the tables >> 3. Basic performce tuning: >> http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php >> There's also a good guide to the postgresql.conf file on varlena.com >> >>> Would it be appropriate to ask for help on my program on this list? >>> Full disclosure: The program won't be initially GPL'ed because it is >>> for my company but it will be released for free to home users and the >>> source code will be made available (similar to other split-license >>> programs) though once my company makes it's money back I think they >>> will fully GPL it (I am on my boss's case about it :p ). >> >> >> >> No problem - what you licence your software under is your concern. >> Once you've taken the basic steps described above, try to pick out a >> specific query that you think is too slow and provide: >> >> 1. PostgreSQL version >> 2. Basic hardware info (as you have) >> 3. Sizes of tables. >> 4. Output of EXPLAIN ANALYZE <query here> >> >> The EXPLAIN ANALYZE runs the query and shows how much work PG thought >> it would be and how much it actually turned out to be. >> >> HTH > > > Thank you very much!! I am using Psql 7.4 on a stock install of > Fedora Core 2 on my IBM thinkpad a22m (P3 1GHz, 512MB RAM, not the > fastest HDD). The drive carrier I am using is connected via USB2 and > uses a few different hard drives with the fastest being a couple of > Barracuda 7200.7 drives (2MB cache, 7,200rpm). I described the program > in my reply to Martijn so here is some of the code (code not related to > psql snipped, let me know if posting it would help - sorry for the > wrapping...): I'm not clear if the database is on the local disk or attached to the USB2. Not sure it's important, since neither will be that fast. If I understand, you scan thousands or millions of files for backup purposes and then issue a select + update/insert for each. Once a partition is scanned, a flag is cleared on all rows. Once all selected files have been dealt with a vaccum/analyse is issued. Some things to look at: 1. How many files are you handling per second? Are the disks involved in the backup as well as the database? 2. What does the output of "vmstat 10" show when the system is running. Is your I/O saturated? CPU? 3. Is your main index (file_src_uuid,file_name,file_parent_dir) being used? Your best bet is to select from "pg_stat_indexes" before and after. 4. If you are updating several hundred thousand rows then you probably don't have enought vacuum memory set aside - try a vacuum full after each set of updates. 5. You might want to batch together queries into transactions of a few hundred or even few thousand updates. -- Richard Huxton Archonet Ltd
After a long battle with technology, dev@archonet.com (Richard Huxton), an earthling, wrote: > 5. You might want to batch together queries into transactions of a few > hundred or even few thousand updates. When this particular application got discussed on local LUG mailing list, this emerged as being one of the factors most likely to be a Big Deal. -- If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me http://www3.sympatico.ca/cbbrowne/lsf.html "A hack is a terrible thing to waste, please give to the implementation of your choice..." -- GJC
Christopher Browne wrote: > After a long battle with technology, dev@archonet.com (Richard Huxton), an earthling, wrote: > >>5. You might want to batch together queries into transactions of a few >>hundred or even few thousand updates. > > > When this particular application got discussed on local LUG mailing > list, this emerged as being one of the factors most likely to be a Big > Deal. Yep, except... Madison said a laptop was involved, so I'm guessing it's an IDE drive lying about sync-ing. If fsync is effectively off that shouldn't have such a huge effect should it? -- Richard Huxton Archonet Ltd
On Mon, Jun 21, 2004 at 08:29:54PM +0100, Richard Huxton wrote: > Christopher Browne wrote: > >When this particular application got discussed on local LUG mailing > >list, this emerged as being one of the factors most likely to be a Big > >Deal. > > Yep, except... Madison said a laptop was involved, so I'm guessing it's > an IDE drive lying about sync-ing. If fsync is effectively off that > shouldn't have such a huge effect should it? The IDE drive lying about syncing is different from fsync being turned off. What the drive thinks doesn't matter until after Postgres has written the WAL, closed the transaction and written the pages out. The fsync will still cause Linux to wait for all the data to be written to the disk, which is still a finite amount of time, the disk buffer is only a few MB. Turning fsync off means Linux will never wait, just buffer in system memory. Similarly, putting it all in one transaction means that within the transaction there is no waiting, only in transaction commit. With fsync on, in/not it transaction can make a really big difference. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Richard Huxton wrote: > Madison Kelly wrote: > >> Richard Huxton wrote: >> >>> Madison Kelly wrote: >>> >>>> Hi all, >>>> >>>> I am new to the list and I didn't want to seem rude at all so I >>>> wanted to ask if this was okay first. >>> >>> >>> >>> >>> No problem. Reading your message below, you might want to try the >>> performance list, but general is a good place to start. >>> >>>> I have a program I have written in perl which uses a postgresSQL >>>> database as the backend. The program works but the performance is >>>> really bad. I have been reading as much as I can on optimizing >>>> performance but still it isn't very reasonable. At one point I had >>>> my program able to process 175,000 records in 16min 10sec on a >>>> Pentium3 650MHz, 448MB RAM test machine. Since then I got a Pentium3 >>>> 1GHz, 512MB system and I have tried a lot of things to get the >>>> performance up but now it is substantially slower and I can't seem >>>> to figure out what I am doing wrong. >>> >>> >>> >>> >>> A few places to start: >>> 1. VACUUM FULL >>> This will make sure any unused space is reclaimed >>> 2. ANALYZE >>> This will recalculate stats for the tables >>> 3. Basic performce tuning: >>> http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php >>> There's also a good guide to the postgresql.conf file on varlena.com >>> >>>> Would it be appropriate to ask for help on my program on this >>>> list? Full disclosure: The program won't be initially GPL'ed because >>>> it is for my company but it will be released for free to home users >>>> and the source code will be made available (similar to other >>>> split-license programs) though once my company makes it's money back >>>> I think they will fully GPL it (I am on my boss's case about it :p ). >>> >>> >>> >>> >>> No problem - what you licence your software under is your concern. >>> Once you've taken the basic steps described above, try to pick out a >>> specific query that you think is too slow and provide: >>> >>> 1. PostgreSQL version >>> 2. Basic hardware info (as you have) >>> 3. Sizes of tables. >>> 4. Output of EXPLAIN ANALYZE <query here> >>> >>> The EXPLAIN ANALYZE runs the query and shows how much work PG thought >>> it would be and how much it actually turned out to be. >>> >>> HTH >> >> >> >> Thank you very much!! I am using Psql 7.4 on a stock install of >> Fedora Core 2 on my IBM thinkpad a22m (P3 1GHz, 512MB RAM, not the >> fastest HDD). The drive carrier I am using is connected via USB2 and >> uses a few different hard drives with the fastest being a couple of >> Barracuda 7200.7 drives (2MB cache, 7,200rpm). I described the program >> in my reply to Martijn so here is some of the code (code not related >> to psql snipped, let me know if posting it would help - sorry for the >> wrapping...): > > > I'm not clear if the database is on the local disk or attached to the > USB2. Not sure it's important, since neither will be that fast. > > If I understand, you scan thousands or millions of files for backup > purposes and then issue a select + update/insert for each. > > Once a partition is scanned, a flag is cleared on all rows. > > Once all selected files have been dealt with a vaccum/analyse is issued. > > > Some things to look at: > 1. How many files are you handling per second? Are the disks involved in > the backup as well as the database? > 2. What does the output of "vmstat 10" show when the system is running. > Is your I/O saturated? CPU? > 3. Is your main index (file_src_uuid,file_name,file_parent_dir) being > used? Your best bet is to select from "pg_stat_indexes" before and after. > 4. If you are updating several hundred thousand rows then you probably > don't have enought vacuum memory set aside - try a vacuum full after > each set of updates. > 5. You might want to batch together queries into transactions of a few > hundred or even few thousand updates. Hi Richard, The database is on the system's local disk and the destination drives are on the USB-connected drives. It is possible to include the server's local disk in a backup job. Without the DB calls I was able to process ~4000 files in ~2secs. 'vmstat 10' shows (system running): r 0; b 0; swapd 3396; free 192356; buff 7084; cache 186508; si 0; so 0; bi 0; bo 5; in 1023; cs 82; us 1; sy 0; id 99; wa 0 'vmstat 10' shows (while the program is running): r 1; b 0; swapd 480; free 180252; buff 8600; cache 195412; si 0; so 0; bi 6; bo 163; in 1037; cs 281; us 97; sy 3; id 0; wa 0 I'm afraid that I am pretty new to postgres (and programming period) so I am not sure if the indexes are being used. I assumed they where simple because I created them but then again assumptions always get me in trouble. :p When you say "select from "pg_stat_indexes"" do you mean select from there those three field instead of 'SELECT <blah> FROM file_dir'?. I currently have "VACUUM ANALYZE" set to run after every big set of UPDATE/INSERTs but that isn't enough? Would it also help if I kept a count of how many records had been processed and running "VACUUM {ANALYZE:FULL}" every so many 'x' records? I have been trying to bunch jobs into a single transaction but for some reason whenever I try to do that PERL dies on me with an empty "$DBI:errstr" error (it shows what line it died on but it is blank where it normally says what went wrong). I was trying to do this like this: $DB->do("BEGIN TRANSACTION") || die $DBI::errstr; # Jump into the sub-routine: &list_files($real_dir, $exclude_list_num, $relative_dir, $file_backup, $file_restore, $file_display); $DB->do("COMMIT;") || die $DBI::errstr; Inside the subroutine I kept a count of how much time went by and to commit the current transaction every 250secs and start a new transaction block: if ( $rec_num > $rec_disp ) { $rec_disp=($rec_disp+500); $nowtime=time; $processtime=($nowtime - $starttime); if ( $processtime >= $committime ) { $committime=$committime+250; $DB->do("COMMIT") || die $DBI::errstr; $DB->do("BEGIN TRANSACTION") || die $DBI::errstr; system 'echo " |- Committed processed records to the database, updating committime to \''.$committime.'\' and proceeding:" >> '.$log; } system 'echo " |- Processing record number \''.$rec_num.'\', \''.$processtime.'\' seconds into operation." >> '.$log; } Is what I am doing wrong obvious at all? When I enter "BEGIN TRANSACTION" directly into 'psql' it seems to work fine so I am sure the syntax is right. Is it a perl prolem maybe? Thank you so much for your help!! Madison
Madison Kelly wrote: > > The database is on the system's local disk and the destination drives > are on the USB-connected drives. It is possible to include the server's > local disk in a backup job. Without the DB calls I was able to process > ~4000 files in ~2secs. > > 'vmstat 10' shows (system running): > r 0; b 0; swapd 3396; free 192356; buff 7084; cache 186508; si 0; so 0; > bi 0; bo 5; in 1023; cs 82; us 1; sy 0; id 99; wa 0 > > 'vmstat 10' shows (while the program is running): > r 1; b 0; swapd 480; free 180252; buff 8600; cache 195412; si 0; so 0; > bi 6; bo 163; in 1037; cs 281; us 97; sy 3; id 0; wa 0 Hmm - well, your CPU usage is up to 97% userland (us 97) and your disk activity is up (bi 6; bo 163) as you'd expect. > I'm afraid that I am pretty new to postgres (and programming period) > so I am not sure if the indexes are being used. I assumed they where > simple because I created them but then again assumptions always get me > in trouble. :p When you say "select from "pg_stat_indexes"" do you mean > select from there those three field instead of 'SELECT <blah> FROM > file_dir'?. No - there are a set of statistics views called pg_stat_xxx (see ch 23.2 of the online manuals). For example, I have a unique index on content_core.cc_id: SELECT relname,indexrelname,idx_scan,idx_tup_read,idx_tup_fetch FROM pg_stat_user_indexes WHERE relname = 'content_core'; indexrelname | idx_scan | idx_tup_read | idx_tup_fetch -------------------+----------+--------------+--------------- content_core_pkey | 717 | 717 | 717 (1 row) SELECT * FROM content_core WHERE cc_id=2; ... SELECT relname,indexrelname,idx_scan,idx_tup_read,idx_tup_fetch FROM pg_stat_user_indexes WHERE relname = 'content_core'; indexrelname | idx_scan | idx_tup_read | idx_tup_fetch -------------------+----------+--------------+--------------- content_core_pkey | 718 | 718 | 718 As you can see, the index was used for this query. > I currently have "VACUUM ANALYZE" set to run after every big set of > UPDATE/INSERTs but that isn't enough? Would it also help if I kept a > count of how many records had been processed and running "VACUUM > {ANALYZE:FULL}" every so many 'x' records? > > I have been trying to bunch jobs into a single transaction but for > some reason whenever I try to do that PERL dies on me with an empty > "$DBI:errstr" error (it shows what line it died on but it is blank where > it normally says what went wrong). I was trying to do this like this: > > $DB->do("BEGIN TRANSACTION") || die $DBI::errstr; > # Jump into the sub-routine: > &list_files($real_dir, $exclude_list_num, $relative_dir, $file_backup, > $file_restore, $file_display); > $DB->do("COMMIT;") || die $DBI::errstr; You can do this, I think better practice is supposed to be: $DB->begin_work; ... $DB->commit; You might want to turn on statement logging for PostgreSQL's logs - details in the manuals (Ch 16.4) > Inside the subroutine I kept a count of how much time went by and to > commit the current transaction every 250secs and start a new transaction > block: > > if ( $rec_num > $rec_disp ) > { > $rec_disp=($rec_disp+500); > $nowtime=time; > $processtime=($nowtime - $starttime); > if ( $processtime >= $committime ) > { > $committime=$committime+250; > $DB->do("COMMIT") || die $DBI::errstr; > $DB->do("BEGIN TRANSACTION") || die $DBI::errstr; > system 'echo " |- Committed processed records to the database, > updating committime to \''.$committime.'\' and proceeding:" >> '.$log; > } > system 'echo " |- Processing record number \''.$rec_num.'\', > \''.$processtime.'\' seconds into operation." >> '.$log; > } > > > Is what I am doing wrong obvious at all? When I enter "BEGIN > TRANSACTION" directly into 'psql' it seems to work fine so I am sure the > syntax is right. Is it a perl prolem maybe? Nothing leaping out at me - although I'd do something like: use IO::File; my $log = new IO::File("> logfile.txt"); print $log "Processing record number $rec_num\n"; $log->close; Or even just: print STDERR "Processing..."; Note there is just a space between the filehandle and string to print. Turn logging up in PostgreSQL (which will slow things down of course) and then get the transaction blocks working, and we'll see what happens then. Oh, and don't forget the tuning document on varlena.com -- Richard Huxton Archonet Ltd