Thread: diff-/patch-functionality for text-type data inside PostgreSQL

diff-/patch-functionality for text-type data inside PostgreSQL

From
"Markus Wollny"
Date:
Hi!

I want to implement a versioning system for text columns of a table
inside a PostgreSQL 8.3 database. As some of the changes to a text field
might be very small compared to the total text size, I'd prefer storing
diffs instead of full previous versions of the text and use a patch-like
function whenever I want to actually roll back to a certain version. I
know that I could probably handle this quite easily in the application
code, but I'd prefer some solution running on the database itself, so
that the application wouldn't have to know anything about storing the
diffs, instead that process would be handled by a ON UPDATE trigger.

So far I have been playing around with PL/PerlU for diff/path
functionality, using CPAN modules Text::Diff and Text::Patch, but
haven't been too successful, as there seems to be some issue with this
mechanism if the text data doesn't contain newlines. Just as an
off-topic info, because it's some issue with the CPAN modules, not with
PostgreSQL:

#!/usr/bin/perl
use Text::Patch;
use Text::Diff;
$src  = "foo sdffasd";
$dst  = "34asd sdf";
$diff = diff( \$src, \$dst, { STYLE => 'Unified' } );
print $diff . "\n";
$out  = patch( $src, $diff, { STYLE => 'Unified' } );
print "Patch successful\n" if $out eq $dst;

Running this results in the following output:
@@ -1 +1 @@
-foo sdffasd+34asd sdf
Hunk #1 failed at line 1.

Anyway, has anybody already done something in this direction,
preferrably in some way that is purely pl/* and wouldn't require any
custom-made C-library? So far I have only found this interesting
description of the implementation of the very same functionality here:
http://www.ciselant.de/projects/pg_ci_diff/doc.html - but there's no
source code supplied for the libpg_ci_diff.so library.

Kind regards

   Markus


Computec Media AG
Sitz der Gesellschaft und Registergericht: Furth (HRB 8818)
Vorstandsmitglieder: Albrecht Hengstenberg (Vorsitzender) und Rainer Rosenbusch
Vorsitzender des Aufsichtsrates: Jurg Marquard
Umsatzsteuer-Identifikationsnummer: DE 812 575 276



Re: diff-/patch-functionality for text-type data inside PostgreSQL

From
Tom Lane
Date:
"Markus Wollny" <Markus.Wollny@computec.de> writes:
> So far I have been playing around with PL/PerlU for diff/path
> functionality, using CPAN modules Text::Diff and Text::Patch, but
> haven't been too successful, as there seems to be some issue with this
> mechanism if the text data doesn't contain newlines.

Almost all diff/patch functions operate line-by-line, so that hardly
seems surprising.

            regards, tom lane

Re: diff-/patch-functionality for text-type data inside PostgreSQL

From
Martijn van Oosterhout
Date:
On Mon, May 04, 2009 at 12:26:13PM +0200, Markus Wollny wrote:
> So far I have been playing around with PL/PerlU for diff/path
> functionality, using CPAN modules Text::Diff and Text::Patch, but
> haven't been too successful, as there seems to be some issue with this
> mechanism if the text data doesn't contain newlines. Just as an
> off-topic info, because it's some issue with the CPAN modules, not with
> PostgreSQL:

I've used the Algorithm::Diff module in the past with success. It works
on sequences of objects rather than just text but it works well. That
means you can diff on word or character level at your choice, and even
control what sequences you consider "equal". That said, it doesn't have
a patch function but that should be fairly easy to make. You'll need to
define your own storage format for the diff though.

http://search.cpan.org/~nedkonz/Algorithm-Diff-1.15/lib/Algorithm/Diff.pm

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: diff-/patch-functionality for text-type data inside PostgreSQL

From
"Markus Wollny"
Date:
Hi!

> -----Ursprüngliche Nachricht-----
> Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Gesendet: Montag, 4. Mai 2009 15:04

> "Markus Wollny" <Markus.Wollny@computec.de> writes:
> > So far I have been playing around with PL/PerlU for diff/path
> > functionality, using CPAN modules Text::Diff and Text::Patch, but
> > haven't been too successful, as there seems to be some
> issue with this
> > mechanism if the text data doesn't contain newlines.
>
> Almost all diff/patch functions operate line-by-line, so that
> hardly seems surprising.

Not so much surprising, no, but I hadn't expected it to fail altogether on entries that just end after one line of text
justbecause they lack a newline character - they are a one line text after all, so I assumed that the diff would
producea "replace this old line with the new one" type of instruction instead of producing something that patch doesn't
seemto be able to process at all. 

Kind regards

   Markus


Computec Media AG
Sitz der Gesellschaft und Registergericht: Fürth (HRB 8818)
Vorstandsmitglieder: Albrecht Hengstenberg (Vorsitzender) und Rainer Rosenbusch
Vorsitzender des Aufsichtsrates: Jürg Marquard
Umsatzsteuer-Identifikationsnummer: DE 812 575 276



Re: diff-/patch-functionality for text-type data insidePostgreSQL

From
"Markus Wollny"
Date:
Hi!

> -----Ursprüngliche Nachricht-----
> Von: Martijn van Oosterhout [mailto:kleptog@svana.org]
> Gesendet: Montag, 4. Mai 2009 15:30

> I've used the Algorithm::Diff module in the past with
> success. It works on sequences of objects rather than just
> text but it works well. That means you can diff on word or
> character level at your choice, and even control what
> sequences you consider "equal". That said, it doesn't have a
> patch function but that should be fairly easy to make. You'll
> need to define your own storage format for the diff though.
>
> http://search.cpan.org/~nedkonz/Algorithm-Diff-1.15/lib/Algori
> thm/Diff.pm

Thank you - I have considered using Algorithm::Diff before, as Text::Diff seems to be based on that and one could, as
youdescribe, create even more granular deltas than just line by line comparisons. The latter would however be fully
sufficientfor my needs, but I think I'll give Algorithm::Diff a closer look now :) 

Kind regards

   Markus


Computec Media AG
Sitz der Gesellschaft und Registergericht: Fürth (HRB 8818)
Vorstandsmitglieder: Albrecht Hengstenberg (Vorsitzender) und Rainer Rosenbusch
Vorsitzender des Aufsichtsrates: Jürg Marquard
Umsatzsteuer-Identifikationsnummer: DE 812 575 276