Re: DBMirror.pl performance change - Mailing list pgsql-general
From | Peter Wilson |
---|---|
Subject | Re: DBMirror.pl performance change |
Date | |
Msg-id | dr2cun$10r5$1@news.hub.org Whole thread Raw |
In response to | DBMirror.pl performance change (Achilleus Mantzios <achill@matrix.gatewaynet.com>) |
List | pgsql-general |
The Whitebeam implementation of DBMirror.pl : http://www.whitebeam.org/library/guide/TechNotes/replicate.rhtm is a complete re-write in 'C' which avoids a lot of the text processing, and what text processing is required is done usinga state machine rather than repeated regular expressions. Before I wrote the 'C' implementation I did look at optimising the Perl version. One ofmy big concerns was the time taking escaping and re-escaping the strings. I can't remember the details now but as far as I can remember a lot ofthat is unnecessary. There seemed to be an unescape of the data then a re-escape to the target database. In practice the data was in the correct format. We make quite heavy use of both BYTEA and large varchar fields in our database. I did some load testing at the time and foundthe new version could replicate 10s of file objects per second - where the Perl version took 10 minutes to replicate a 120K BYTEA field (both ona slowish machine, but the Perl version wasn't much better on a fast machine *and* took 97% CPU). I also took the opportunity to make the 'C' version much more tolerant to lost DB connections without having to restart andadded a few other tweaks to make it more flexible. It's released under the BSD license now as well Pete -- http://www.whitebeam.org http://www.yellowhawk.co.uk -------- Achilleus Mantzios wrote: > I discovered a problem in DBMirror.pl, performance wise. > > pending.c stores data in a way > very similar to the PgSQL input "\" escaped format. > > When the field is of type bytea, and the source of data is binary, then > this produces 2 additional backslashes for every unprintable > char. > > The performance in function extractData in DBMirror.pl, really suffers > from this condition, since it breaks data in chunks of "\" delimited > strings. > > Informally speaking, performance tends to be O(n) where n is the size > of the data. > > This can be remedied if we break data in chunks of "'" rather than "\". > "'" happens much more infrequently in common binary files (bz2, tiff, jpg, > pdf etc..), and if we notice that odd number of contained "\", signals an > intermidiate "'", whereas even number of "\" signals the final "'", > then we can make this routine run much faster. > > I attach the new extractData function. > > Now replicating a 400 k tiff takes 3 seconds instead of 12 minutes > it used to do. > > I am wondering about the state of > http://www.whitebeam.org/library/guide/TechNotes/replicate.rhtm > > Please feel free for any comments. > > Pete could you test this new DBMirror.pl, to see how it behaves > in comparison with your C++ solution? > > > > ------------------------------------------------------------------------ > > sub extractData($$) { > my $pendingResult = $_[0]; > my $currentTuple = $_[1]; > my $fnumber; > my %valuesHash; > $fnumber = 4; > my $dataField = $pendingResult->getvalue($currentTuple,$fnumber); > my $numofbs; > > while(length($dataField)>0) { > # Extract the field name that is surronded by double quotes > $dataField =~ m/(\".*?\")/s; > my $fieldName = $1; > $dataField = substr $dataField ,length($fieldName); > $fieldName =~ s/\"//g; #Remove the surronding " signs. > > if($dataField =~ m/(^= )/s) { > #Matched null > $dataField = substr $dataField , length($1); > $valuesHash{$fieldName}=undef; > } > elsif ($dataField =~ m/(^=\')/s) { > #Has data. > my $value; > $dataField = substr $dataField ,2; #Skip the =' > LOOP: { #This is to allow us to use last from a do loop. > #Recommended in perlsyn manpage. > do { > my $matchString; > my $matchString2; > #Find the substring ending with the first ' or first \ > $dataField =~ m/(.*?[\'])?/s; > $matchString = $1; > > $numofbs = ($matchString =~ tr/\\//) % 2; > > if ($numofbs == 1) { #// odd number of \, i.e. intermediate ' > $matchString2 = substr $matchString,0, length($matchString)-2; > $matchString2 =~ s/\\\\/\\/g; > $value .= ($matchString2 . "\'"); > $dataField = substr $dataField,length($matchString); > } > else { #// even number of \, i.e. found end of data > $matchString2 = substr $matchString,0, length($matchString)-1; > $matchString2 =~ s/\\\\/\\/g; > $value .= $matchString2; > $dataField = substr $dataField,length($matchString)+1; > last; > } > > > } until(length($dataField)==0); > } > $valuesHash{$fieldName} = $value; > > > }#else if > else { > > logErrorMessage "Error in PendingData Sequence Id " . > $pendingResult->getvalue($currentTuple,0); > die; > } > > > > } #while > return %valuesHash; > > } > > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
pgsql-general by date: