Re: DBMirror.pl performance change - Mailing list pgsql-general

From Peter Wilson
Subject Re: DBMirror.pl performance change
Date
Msg-id dr2cun$10r5$1@news.hub.org
Whole thread Raw
In response to DBMirror.pl performance change  (Achilleus Mantzios <achill@matrix.gatewaynet.com>)
List pgsql-general
The Whitebeam implementation of DBMirror.pl :
    http://www.whitebeam.org/library/guide/TechNotes/replicate.rhtm

is a complete re-write in 'C' which avoids a lot of the text processing, and what text processing is required is done
usinga state machine rather  
than repeated regular expressions. Before I wrote the 'C' implementation I did look at optimising the Perl version. One
ofmy big concerns was the  
time taking escaping and re-escaping the strings. I can't remember the details now but as far as I can remember a lot
ofthat is unnecessary. There  
seemed to be an unescape of the data then a re-escape to the target database. In practice the data was in the correct
format.

We make quite heavy use of both BYTEA and large varchar fields in our database. I did some load testing at the time and
foundthe new version could  
replicate 10s of file objects per second - where the Perl version took 10 minutes to replicate a 120K BYTEA field (both
ona slowish machine, but the  
Perl version wasn't much better on a fast machine *and* took 97% CPU).

I also took the opportunity to make the 'C' version much more tolerant to lost DB connections without having to restart
andadded a few other tweaks  
to make it more flexible.

It's released under the BSD license now as well

Pete
--
http://www.whitebeam.org
http://www.yellowhawk.co.uk
--------

Achilleus Mantzios wrote:
> I discovered a problem in DBMirror.pl, performance wise.
>
> pending.c stores data in a way
> very similar to the PgSQL input "\" escaped format.
>
> When the field is of type bytea, and the source of data is binary, then
> this produces 2 additional backslashes for every unprintable
> char.
>
> The performance in function extractData in DBMirror.pl, really suffers
> from this condition, since it breaks data in chunks of "\" delimited
> strings.
>
> Informally speaking, performance tends to be O(n) where n is the size
> of the data.
>
> This can be remedied if we break data in chunks of "'" rather than "\".
> "'" happens much more infrequently in common binary files (bz2, tiff, jpg,
> pdf etc..), and if we notice that odd number of contained "\", signals an
> intermidiate "'", whereas even number of "\" signals the final "'",
> then we can make this routine run much faster.
>
> I attach the new extractData function.
>
> Now replicating a 400 k tiff takes 3 seconds instead of 12 minutes
> it used to do.
>
> I am wondering about the state of
> http://www.whitebeam.org/library/guide/TechNotes/replicate.rhtm
>
> Please feel free for any comments.
>
> Pete could you test this new DBMirror.pl, to see how it behaves
> in comparison with your C++ solution?
>
>
>
> ------------------------------------------------------------------------
>
> sub extractData($$) {
>   my $pendingResult = $_[0];
>   my $currentTuple = $_[1];
>   my $fnumber;
>   my %valuesHash;
>   $fnumber = 4;
>   my $dataField = $pendingResult->getvalue($currentTuple,$fnumber);
>   my $numofbs;
>
>   while(length($dataField)>0) {
>     # Extract the field name that is surronded by double quotes
>     $dataField =~ m/(\".*?\")/s;
>     my $fieldName = $1;
>     $dataField = substr $dataField ,length($fieldName);
>     $fieldName =~ s/\"//g; #Remove the surronding " signs.
>
>     if($dataField =~ m/(^= )/s) {
>       #Matched null
>     $dataField = substr $dataField , length($1);
>       $valuesHash{$fieldName}=undef;
>     }
>     elsif ($dataField =~ m/(^=\')/s) {
>       #Has data.
>       my $value;
>       $dataField = substr $dataField ,2; #Skip the ='
>     LOOP: {  #This is to allow us to use last from a do loop.
>          #Recommended in perlsyn manpage.
>       do {
>     my $matchString;
>     my $matchString2;
>     #Find the substring ending with the first ' or first \
>     $dataField =~ m/(.*?[\'])?/s;
>     $matchString = $1;
>
>     $numofbs = ($matchString =~ tr/\\//) % 2;
>
>     if ($numofbs == 1) { #// odd number of \, i.e. intermediate '
>         $matchString2 = substr $matchString,0, length($matchString)-2;
>         $matchString2 =~ s/\\\\/\\/g;
>         $value .= ($matchString2 . "\'");
>         $dataField = substr $dataField,length($matchString);
>     }
>     else { #// even number of \, i.e. found end of data
>         $matchString2 = substr $matchString,0, length($matchString)-1;
>         $matchString2 =~ s/\\\\/\\/g;
>         $value .= $matchString2;
>         $dataField = substr $dataField,length($matchString)+1;
>         last;
>     }
>
>
>       } until(length($dataField)==0);
>   }
>       $valuesHash{$fieldName} = $value;
>
>
>       }#else if
>       else {
>
>         logErrorMessage "Error in PendingData Sequence Id " .
>         $pendingResult->getvalue($currentTuple,0);
>         die;
>       }
>
>
>
>   } #while
>   return %valuesHash;
>
> }
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

pgsql-general by date:

Previous
From: Janning Vygen
Date:
Subject: Postgresql/DBA/Sysadmin Consultant in Düsseldorf,Germany
Next
From: "A. Kretschmer"
Date:
Subject: Re: Postgresql/