Re: [NOVICE] Perl - Postgres - Mailing list pgsql-general
From | codeWarrior |
---|---|
Subject | Re: [NOVICE] Perl - Postgres |
Date | |
Msg-id | b1e3ub$j5e$1@news.hub.org Whole thread Raw |
In response to | Re: [NOVICE] Perl - Postgres (Richard A Lough <ralough.ced@dnet.co.uk>) |
List | pgsql-general |
DOS / WINDOWS: V:\perl\bin> perldoc pg.pm > pgdoc.txt -- Then read the text file... Unix: /usr/bin/perl/perldoc pg.pm > pddoc.txt Then there's always the CPAN web site Pg module documentation: (I know it look bad but this link works...) http://ls6-www.cs.uni-dortmund.de/cgi-bin/SFgate?language=English&verbose=1& listenv=DL&application=cpan&convert=CPAN&converthl=&refinequery=application% 3dcpan%26convert%3dCPAN%26database%3dwait%252FDB%252Fcpan%26detex%3d1%26form extern%3dCPAN%26multiple%3d1%26range%3d1%26verbose%3d1%26name%3dPg.pm%26syno psis%3d%26text%3d%26author%3d%26tie%3dor%26_lines%3dwhole%2bdocument%26maxhi ts%3d40%26language%3dEnglish&formintern=&formextern=CPAN&transquery=name%3dp g.pm&_lines=&multiple=0&descriptor=wait%2fDB%2fcpan%7c1000%7c18068%7cDBD::Pg %20-%20PostgreSQL%20database%20driver%20for%20the%20DBI%20module%20%7cCPAN%7 cwait%7c%2fservices%2fwww-db%2fWAIT%2fDB%2fcpan%7c301 OR http://www.cpan.org "Richard A Lough" <ralough.ced@dnet.co.uk> wrote in message news:3E381F1A.2FC108E9@dnet.co.uk... > This is a multi-part message in MIME format. > --------------C73B285B6B29F4468FCAF0DE > Content-Type: text/plain; charset=us-ascii > Content-Transfer-Encoding: 7bit > > Rosta Farzan wrote: > > > > Hi > > > > Where can I find the list of the functions of pg module, for example how > > to insert a row to the database? > > Does anybody know a tutorial on perl & Postgres? > > > > Thanks > > Rosta > > > I have seen some examples. I believe they came with the Pg module. > > I attach some GPL code which was intended for Mysql. My hack is > probably a good example of how _not_ to interface to Postgres, > but you will need something to start with. The original article > should be still on the DrDobbs site somewhere. (try Aug 2002) > > HTH > > Richard A Lough > --------------C73B285B6B29F4468FCAF0DE > Content-Type: application/x-perl; > name="spider.pl" > Content-Transfer-Encoding: 7bit > Content-Disposition: inline; > filename="spider.pl" > > #!/usr/bin/perl -w > #*------------------------------------------------------------------- > #*- spider.pl: > #*- Author: Manu Konchady > #*- License: GPL > #*- Input: A query > #*- Output: A list of the hubs and authorities for the query > #*- Module Dependencies: HTML::Parser 3.0, LWP::UserAgent, > #*- URI::URL, HTTP::Request, DBI and Posix. > #*- > #*- Description: We assume that two tables have been created for this > #*- particular query to store URLs and associated links > #*- prior to the execution of the spider. An entry has also > #*- been made in a queries table for this query. > #*- > #*- An initial collection of URLs are stored in the URL table > #*- for the query followed by a parallel execution of spiders > #*- to selectively generate the web tree for the query. > #*- After all spiders have terminated, the hubs and authorities > #*- for the query are listed. The list of all URLs sorted by > #*- relevancy can also be viewed. > #*- > #*- Notes: This code is a sample and should be modified as > #*- needed. For example, relevancy can be computed using several > #*- options as described in the article. Links can be pruned or > #*- followed depending on user criteria. Priorities can be > #*- set for certain domains. Entity tables can be used to track > #*- entities across web pages. > #*- > #*- RAL hacked to do postgres instead of MySQL > #*- tables are utab, queries, ltab > #*- ( url varchar(120), status varchar(10), site varchar(120), relevancy float4, level int4, spider int4) > #*- ( query varchar(250), qword varchar(250), url_limit int4, url_count int4, status varchar(10)) > #*- the "query" is the text to match, the qword is a name for reference and > #*- which was used to set the tables up > #*- ( url varchar(250), link varchar(250)) > #*------------------------------------------------------------------- > > use POSIX ":sys_wait_h"; > use PgDBI; > use MyWeb; > use strict; > > my $dbname = 'querynet1'; > my ($i, @pid, $done, $num_spiders, $kid, $query, %urls, $count); > my ($qword, $utab, $ltab, $dbh, $sth, $command, $url_limit, $url); > > #*-- The first argument is a table prefix associated with the query. > #*-- It is used to assign the url and link tables for the query. > #*-- The database name for the db connection must be entered. > $qword = $ARGV[0]; > $utab = "$qword" . "_u"; $ltab = "$qword" . "_l"; > $dbh = PgDBI::connect_db("dbname=$dbname"); > > #*-- get the full query and maximum number of urls for the query > $command = "select query, url_limit from queries where qword = '$qword'"; > $sth = PgDBI::execute_stmt($dbh, $command); > ($query, $url_limit) = PgDBI::fetch_row($sth); > > #*-- get an initial set of urls from Google for the query > &call_google(); > #*-- close the database in ccase google has crashed it > > PgDBI::disconnect_db($dbh, $sth); > > #*-- start the parallel operation with 4 spiders, this > #*-- works on Linux. On a Windows platform, you may need > #*-- to use the Win32 module to start individual processes > #$num_spiders = 1; > $num_spiders = 4; > for $i (1..$num_spiders) > { > if ($pid[$i] = fork()) #*-- the parent code continues > { next; } > else #*-- the child code continues > { defined($pid[$i]) or die "fork: $!"; &fetch_text($i); } > } > #################################################################### > #*-- wait till all the spiders are complete. Every two seconds, > #*-- check the status of the spiders > $done = 0; > while (!($done)) > { > for $i (1..$num_spiders) > { > next if ($pid[$i] == 0); > $kid = waitpid($pid[$i], &WNOHANG); > $pid[$i] = 0 if ($kid == -1); > } > $done = 1; > for $i (1..$num_spiders) { $done = 0 if ($pid[$i] != 0); } > sleep(2); > } > > #*-- clean up after the spiders, find all URLs which were > #*-- not processed and delete them from the link table > > $dbh = PgDBI::connect_db("dbname=$dbname"); > $command = "select url from $utab where status != 'd'"; > $sth = PgDBI::execute_stmt($dbh, $command); > while ( ($url) = PgDBI::fetch_row($sth) ) { $urls{$url}++; } > foreach $url (keys %urls) > { $command = "delete from $ltab where url = '$url'"; > $sth = PgDBI::execute_stmt($dbh, $command); > } > > #*-- delete the unprocessed urls from the url table > $command = "delete from $utab where status != 'd'"; > $sth = PgDBI::execute_stmt($dbh, $command); > > $command = "update queries set status = 'done' where qword = '$qword'"; > $sth = PgDBI::execute_stmt($dbh, $command); > > #*-- print the top 10 hubs and authorities > print ("List of top 10 Hubs\n"); $i = 1; > $command = "select count(*), url from $ltab group by url order by 1 desc"; > $sth = PgDBI::execute_stmt($dbh, $command); > while ( ($count, $url) = PgDBI::fetch_row($sth) ) > { print ("$i. $url\n"); $i++; last if ($i == 11); } > > print ("List of top 10 Authorities\n"); $i = 1; > $command = "select count(*), link from $ltab group by link order by 1 desc"; > $sth = PgDBI::execute_stmt($dbh, $command); > while ( ($count, $url) = PgDBI::fetch_row($sth) ) > { print ("$i". "- "." $url\n"); $i++; last if ($i == 11); } > > PgDBI::disconnect_db($dbh, $sth); > > exit(0); > > > #*------------------------------------------------------------ > #*- Each of the spiders run the following code and then terminate. > #*- An unprocessed URL is selected and traversed. New links > #*- are added to the tables. The spider number is passed > #*- as a parameter > #*------------------------------------------------------------ > sub fetch_text() > { > > my ($snum) = @_; > my ($dbh, $sth, $done, $command, $count, $links, $ref_text, $usite); > my ($ref_links, $url, $level, $link, %links, $site, $url_temp); > > #*-- create a handle for the database > $dbh = PgDBI::connect_db("dbname=$dbname"); > $done = 0; > while (!($done)) > { > $link = ''; > ## $command = "lock tables $utab write, queries write"; # no good in postgres > # $command = "begin work;"; # ral new > # $sth = PgDBI::execute_stmt($dbh, $command); > # $command = "lock table $utab"; # ral new > # $sth = PgDBI::execute_stmt($dbh, $command); > # $command = "lock table queries"; # ral new > # $sth = PgDBI::execute_stmt($dbh, $command); > > #*-- check if there are any unprocessed URLs > $command = "select url, level, site from $utab where status = ''"; # Pg > $sth = PgDBI::execute_stmt($dbh, $command); > ($url, $level, $usite) = PgDBI::fetch_row($sth); > $command = "select count(*) from $utab where status = ''"; # Pg > $sth = PgDBI::execute_stmt($dbh, $command); > ($count) = PgDBI::fetch_row($sth); > > if ($count == 0) > { $done = 1; next; } # Pg > > #*-- update the number of URLs processed in the queries table > $command = "select count(*) from $utab where status = 'd'"; > $sth = PgDBI::execute_stmt($dbh, $command); > ($count) = PgDBI::fetch_row($sth); > $command = "update queries set url_count = $count where qword = '$qword'"; > $sth = PgDBI::execute_stmt($dbh, $command); > #*-- check if the number of processed URLs exceeds the limit > if ($count > $url_limit) > # { $done = 1; $sth = PgDBI::execute_stmt($dbh, "unlock tables"); next; } # no good in postgres > { $done = 1; > # $sth = PgDBI::execute_stmt($dbh, "commit work"); # > next; } # Pg > > #*-- set the status of the URL and unlock the tables > $url_temp = $url; > $url =~ s/'/\\'/g; $url =~ s/\$/\\\$/; > $command = "update $utab set status = 'v' where url = '$url'"; > $sth = PgDBI::execute_stmt($dbh, $command); > > #*-- if this is an irrelevant URL, then remove it > $url = $url_temp; > ($ref_text, $ref_links) = MyWeb::parse_URL($url); > if (!(&match($$ref_text, $query) ) ) > { > $command = "delete from $utab where url = '$url'"; > $sth = PgDBI::execute_stmt($dbh, $command); > if ($link eq '') { next; } else { > $command = "delete from $ltab where url = '$url'"; > $sth = PgDBI::execute_stmt($dbh, $command); > next;} > } > > #*-- add all links for this relevant URL to $ltab > $level++; %links = %$ref_links; > foreach $link (keys %links) > { > > ($site = $link) =~ s#/.*$##i; > #*-- skip links to ads and other formatted documents > next if ($link =~ /doubleclick/i); > next if ($link =~ /affiliates\./i); > next if ($link =~ /\.(jpg|gif|pdf|ps|doc|xls|ppt|mpg|mpeg)$/i); > #*-- skip links to the same site, except when one of the > #*-- query keywords matches the link > next if ( ($site eq $usite) && (!(&match_link($link, $query))) ); > $link =~ s/'/\\'/g; $link =~ s/\$/\\\$/; > $link = lc($link); > #*-- check for duplicates > $command = "select count(*) from $utab where url = '$link'"; > $sth = PgDBI::execute_stmt($dbh, $command); > ($count) = PgDBI::fetch_row($sth); > next if ($count != 0); > #*-- check for too many URLs from the same site > $command = "select count(*) from $utab where site = '$site'"; > $sth = PgDBI::execute_stmt($dbh, $command); > ($count) = PgDBI::fetch_row($sth); > next if ($count > 200); > #*-- create entries in the link and url tables > $command = "insert into $ltab values ('$url', '$link')"; > $sth = PgDBI::execute_stmt($dbh, $command); > $command = "insert into $utab values ('$link', '','$site', 0.0, $level, $snum)"; > $sth = PgDBI::execute_stmt($dbh, $command); > > } > > #*-- set the status and relevancy of the URL, text from > #*-- the web page can be used to set relevancy values > $command = "update $utab set status = 'd' where url = '$url'"; > $sth = PgDBI::execute_stmt($dbh, $command); > $command = "update $utab set relevancy = 0.2 where url = '$url'"; > $sth = PgDBI::execute_stmt($dbh, $command); > > } > PgDBI::disconnect_db($dbh, $sth); > sleep (3); > exit(0); > > } > > #*------------------------------------------------------ > #*- check if query satisfies the boolean query. > #*------------------------------------------------------ > sub match() > { > my ( $text, $query_terms) = @_; > my ($word, @words, $quotes, $retval); > > $quotes = '"\''; > #*-- split a query with the AND operator and recursively > #*-- parse the query > if ($query_terms =~ /\bAND\b/i) > { > @words = split(/\bAND\b/i, $query_terms); > foreach $word (@words) > { > if ($word =~ /\((.*?)\)/) > { $retval = &match($text, $1); > if ($retval == 0) { return(0); } else { next; } > } > $word =~ s/[$quotes]//g; $word =~ s/^\s+//; $word =~ s/\s+$//; > return(0) if ($text !~ /\b$word\b/i); > } > return(1); > } > > #*-- split a query with the OR operator and recursively > #*-- parse the query > if ($query_terms =~ /\bOR\b/i) > { > @words = split(/\bOR\b/i, $query_terms); > foreach $word (@words) > { > if ($word =~ /\((.*?)\)/) > { $retval = &match($text, $1); > if ($retval == 1) { return(1); } else { next; } > } > $word =~ s/[$quotes]//g; $word =~ s/^\s+//; $word =~ s/\s+$//; > return(1) if ($text =~ /\b$word\b/i); > } > return(0); > } > > #*-- assume an AND operator for a query with multiple terms > $query_terms =~ s/[$quotes]//g; $query_terms =~ s/^\s+//; $query_terms =~ s/\s+$//; > $query_terms =~ s/([()])/\\$1/g; > @words = split(/\s+/, $query_terms); > foreach $word (@words) > { return(0) if ($text !~ /\b$word\b/i); } > return(1); > > } > > > #*------------------------------------------------------ > #*- check if any words in the query match the query text > #*------------------------------------------------------ > sub match_link() > { > my ($link, $query) = @_; > > my (@words, $word, $match); > > $match = 0; > @words = split(/\s+/, $query); > foreach $word (@words) > { > next if ( ($word =~ /\bAND\b/i) || ($word =~ /\bOR\b/i) || ($word =~ /\bNOT\b/i) ); > $match = 1 if ($link =~ /$word/i); > } > > return($match); > > } > > > #*------------------------------------------- > #*- load the initial set of URLs from Google > #*------------------------------------------- > sub call_google() > { > > my ($sql, $query, $url_string, $reg, $j, $script, $ref_html); > my ($output, $site, %urls, $url); > my ($count) = 0; > > $sth = PgDBI::execute_stmt($dbh, "delete from $utab"); > $sth = PgDBI::execute_stmt($dbh, "delete from $ltab"); > > #*-- get the full query > $sql = "select query from queries where qword = '$qword'"; > $sth = PgDBI::execute_stmt($dbh, "$sql"); > ($query) = PgDBI::fetch_row($sth); > > #*-- get urls from Google > $query =~ s/(\W)/sprintf("%%%x", ord($1))/eg; > $url_string = "http://www.google.com/search?num=100&safe=off&btnG=Google+Search"; > $url_string .= "&q=$query&filter=0&start="; > > $reg = '<p><a href=http://(.*?)>'; > undef $/; > for ($j = 0; $j < 46; $j += 50) > { > $script = "$url_string$j"; > ($ref_html) = MyWeb::get_URL($script); > $output = $$ref_html; #*-- get all the data > while ($output =~ /$reg/sgi) > { ($url = $1) =~ tr/A-Z/a-z/; $urls{$url}++; } > sleep(3); > } > > #*-- put the urls in a table > foreach $url (keys %urls) > { > $count++; > ($site = $url) =~ s#/.*$##i; $url =~ s/'/\\'/g; $url =~ s/\$/\\\$/; > $command = "insert into $utab values ('$url', '', '$site', 0.0, 0, 0)"; > $sth = PgDBI::execute_stmt($dbh, $command); > } > > } > > --------------C73B285B6B29F4468FCAF0DE > Content-Type: text/plain > Content-Disposition: inline > Content-Transfer-Encoding: 8bit > MIME-Version: 1.0 > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > > --------------C73B285B6B29F4468FCAF0DE-- >
pgsql-general by date: