Thread: md5 of table

md5 of table

From
Sim Zacks
Date:
Is there a way to get an md5 or other hash of an entire table?

I want to be able to easily compare 2 tables in different databases.

I thought about using dblink and the EXCEPT query, but then I need to
know the field list of each query result, which is a pain in the butt.

If I could return an md5 of the entire table, then I could check if the
tables have the same hash and be confident enough that the tables were
identical.


Thanks
Sim


Re: md5 of table

From
Karsten Hilbert
Date:
On Thu, Sep 01, 2011 at 11:47:24AM +0300, Sim Zacks wrote:

> Is there a way to get an md5 or other hash of an entire table?
>
> I want to be able to easily compare 2 tables in different databases.
>
> I thought about using dblink and the EXCEPT query, but then I need to
> know the field list of each query result, which is a pain in the
> butt.
>
> If I could return an md5 of the entire table, then I could check if
> the tables have the same hash and be confident enough that the tables
> were identical.

One option might be to pg_dump in an appropriate format and
md5-compare the output ?

Another option might be to

- cross-check columns/column types
    - query from information_schema
- compare row counts
    - may need a lock
- compare table sizes
    - may need vaccum ?

If all three match that may be good enough ?

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: md5 of table

From
Achilleas Mantzios
Date:
md5 has size limitations, the second approach seems more practical.

Στις Thursday 01 September 2011 12:30:45 ο/η Karsten Hilbert έγραψε:
> On Thu, Sep 01, 2011 at 11:47:24AM +0300, Sim Zacks wrote:
>
> > Is there a way to get an md5 or other hash of an entire table?
> >
> > I want to be able to easily compare 2 tables in different databases.
> >
> > I thought about using dblink and the EXCEPT query, but then I need to
> > know the field list of each query result, which is a pain in the
> > butt.
> >
> > If I could return an md5 of the entire table, then I could check if
> > the tables have the same hash and be confident enough that the tables
> > were identical.
>
> One option might be to pg_dump in an appropriate format and
> md5-compare the output ?
>
> Another option might be to
>
> - cross-check columns/column types
>     - query from information_schema
> - compare row counts
>     - may need a lock
> - compare table sizes
>     - may need vaccum ?
>
> If all three match that may be good enough ?
>
> Karsten
> --
> GPG key ID E4071346 @ gpg-keyserver.de
> E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346
>



--
Achilleas Mantzios

Re: md5 of table

From
Sim Zacks
Date:
On 09/01/2011 12:26 PM, Pavel Stehule wrote:
> Hello
>
> postgres=# create table tt(a int, b varchar);
> CREATE TABLE
> postgres=# insert into tt values(10,'hello');
> INSERT 0 1
>
> postgres=# select md5(array_to_string(array_agg(md5(tt::text)),'')) from tt;
>                 md5
> ----------------------------------
>   20a92a676f52699e613da1bb114bd6f0
> (1 row)
>
> Regards
>
> Pavel Stehule
Would be perfect, but 8.2 can't cast a UDT to text.
ERROR:  cannot cast type tt to text
LINE 1: select tt::text from tt

Thanks
Sim

Re: md5 of table

From
Grzegorz Jaśkiewicz
Date:
On Thu, Sep 1, 2011 at 11:14 AM, Sim Zacks <sim@compulab.co.il> wrote:
> On 09/01/2011 12:26 PM, Pavel Stehule wrote:
>>
>> Hello
>>
>> postgres=# create table tt(a int, b varchar);
>> CREATE TABLE
>> postgres=# insert into tt values(10,'hello');
>> INSERT 0 1
>>
>> postgres=# select md5(array_to_string(array_agg(md5(tt::text)),'')) from
I do that as well, but it might have questionable performance when
your table has 16M rows, and is 50GB +


--
GJ

Re: md5 of table

From
Sim Zacks
Date:
body p { margin-bottom: 0cm; margin-top: 0pt; }

  <body style="direction: ltr;"
    bidimailui-detected-decoding-type="UTF-8" bgcolor="#ffffff"
    text="#000000">

    <blockquote
cite="mid:CAFj8pRCf0tsUQs307Xoj1SiPjbdm8yjhKQkn=X=bQGUAUUMpGA@mail.gmail.com"
      type="cite">

I am not sure if this will work, but you can try it

http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks#Cast_to_varchar

Pavel



    I appreciate your help, but UDTs don't have input/ouput functions
    unless you define them manually and I need this for all of my
    tables.

    Thanks
    Sim

Re: md5 of table

From
Sim Zacks
Date:
<meta content=3D"text/html; charset=3DUTF-8" http-equiv=3D"Content-Type=
">
    body p { margin-bottom: 0cm; margin-top: 0pt; }

  <body style=3D"direction: ltr;"
    bidimailui-detected-decoding-type=3D"UTF-8" bgcolor=3D"#ffffff"
    text=3D"#000000">
    On 09/01/2011 01:35 PM, Grzegorz Ja=C5=9Bkiewicz wrote:
    <blockquote
cite=3D"mid:CAJY59_h6TqWfh3NKxfDCq1Pb9hBcYGBqxb5h+czWcgLj4ETeGQ@mail.gmail.=
com"
      type=3D"cite">
      On Thu, Sep 1, 2011 at 11:14 AM, Sim Zacks <a class=3D=
"moz-txt-link-rfc2396E" href=3D"mailto:sim@compulab.co.il"><sim@compulab=
.co.il> wrote:


        On 09/01/2011 12:26 PM, Pavel Stehule wrote:



Hello

postgres=3D# create table tt(a int, b varchar);
CREATE TABLE
postgres=3D# insert into tt values(10,'hello');
INSERT 0 1

postgres=3D# select md5(array_to_string(array_agg(md5(tt::text)),'')) from



      I do that as well, but it might have questionable perf=
ormance when
your table has 16M rows, and is 50GB +


    I don't need performance. This is for regression testing for a
      new database version. I want to run my functions in the old db and
      the new db and when it modifies a table, I want to be able to
      check that the tables are the same.
    Sim

Re: md5 of table

From
Scott Marlowe
Date:
On Thu, Sep 1, 2011 at 3:48 AM, Achilleas Mantzios
<achill@matrix.gatewaynet.com> wrote:
> md5 has size limitations, the second approach seems more practical.

Really?  I was not aware of size limits of md5, what are they?

Re: md5 of table

From
Achilleas Mantzios
Date:
Στις Thursday 01 September 2011 15:50:21 ο/η Scott Marlowe έγραψε:
> On Thu, Sep 1, 2011 at 3:48 AM, Achilleas Mantzios
> <achill@matrix.gatewaynet.com> wrote:
> > md5 has size limitations, the second approach seems more practical.
>
> Really?  I was not aware of size limits of md5, what are they?
>

sorry, i was wrong. i dont know why i had this impression,
just checked with a 43GB table on a freebsd machine and went fine.

--
Achilleas Mantzios

Re: md5 of table

From
Merlin Moncure
Date:
2011/9/1 Grzegorz Jaśkiewicz <gryzman@gmail.com>:
> On Thu, Sep 1, 2011 at 11:14 AM, Sim Zacks <sim@compulab.co.il> wrote:
>> On 09/01/2011 12:26 PM, Pavel Stehule wrote:
>>>
>>> Hello
>>>
>>> postgres=# create table tt(a int, b varchar);
>>> CREATE TABLE
>>> postgres=# insert into tt values(10,'hello');
>>> INSERT 0 1
>>>
>>> postgres=# select md5(array_to_string(array_agg(md5(tt::text)),'')) from
> I do that as well, but it might have questionable performance when
> your table has 16M rows, and is 50GB +

you need order by for that to work.  I would do it like this:
select md5(array(select foo from foo order by foo_pkey)::text);

it's great quick'n'dirty, but not much scalable beyond millions.

OP:
> I thought about using dblink and the EXCEPT query, but then I need to know the field list of each query result, which
isa pain in the butt. 

That is not correct.  As long as the table definitions are precisely
the same, you can move records across dblink without specifying
fields.  You do this by using record type for the composite which
dblink sends across as text.

merlin

Re: md5 of table

From
Sim Zacks
Date:
> OP:
>> I thought about using dblink and the EXCEPT query, but then I need to know the field list of each query result,
whichis a pain in the butt. 
> That is not correct.  As long as the table definitions are precisely
> the same, you can move records across dblink without specifying
> fields.  You do this by using record type for the composite which
> dblink sends across as text.
>
> merlin

Do you have a quick example? This is what I have tried:

select * from tbla
except
select * from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select * from tbla')

The error I get back is:
ERROR:  a column definition list is required for functions returning
"record"

Sim


Re: md5 of table

From
Robert Treat
Date:
2011/9/1 Merlin Moncure <mmoncure@gmail.com>:
> 2011/9/1 Grzegorz Jaśkiewicz <gryzman@gmail.com>:
>> On Thu, Sep 1, 2011 at 11:14 AM, Sim Zacks <sim@compulab.co.il> wrote:
>>> On 09/01/2011 12:26 PM, Pavel Stehule wrote:
>>>>
>>>> Hello
>>>>
>>>> postgres=# create table tt(a int, b varchar);
>>>> CREATE TABLE
>>>> postgres=# insert into tt values(10,'hello');
>>>> INSERT 0 1
>>>>
>>>> postgres=# select md5(array_to_string(array_agg(md5(tt::text)),'')) from
>> I do that as well, but it might have questionable performance when
>> your table has 16M rows, and is 50GB +
>
> you need order by for that to work.  I would do it like this:
> select md5(array(select foo from foo order by foo_pkey)::text);
>
> it's great quick'n'dirty, but not much scalable beyond millions.

I've always liked doing this with my pager:

 [robert@client-168] export PAGER=md5
-=[11:40:25 Thu Sep 01]=---=[ pagila-0.10.1 ]=-
 [robert@client-168] psql -hlocalhost -dpagila
psql (9.0.4, server 9.1beta3)
WARNING: psql version 9.0, server version 9.1.
         Some psql features might not work.
Type "help" for help.

pagila=# select * from actor order by actor_id;
f381ebdefe0aada9c0bc14e657962c1f


Robert Treat
conjecture: xzilla.net
consulting: omniti.com

Re: md5 of table

From
Merlin Moncure
Date:
2011/9/1 Sim Zacks <sim@compulab.co.il>:
>
>> OP:
>>>
>>> I thought about using dblink and the EXCEPT query, but then I need to
>>> know the field list of each query result, which is a pain in the butt.
>>
>> That is not correct.  As long as the table definitions are precisely
>> the same, you can move records across dblink without specifying
>> fields.  You do this by using record type for the composite which
>> dblink sends across as text.
>>
>> merlin
>
> Do you have a quick example? This is what I have tried:
>
> select * from tbla
> except
> select * from dblink('host=dbhost dbname=otherdb user=myuser
> password=mypwd'::text, 'select * from tbla')
>
> The error I get back is:
> ERROR:  a column definition list is required for functions returning
> "record"

sure:
select tbla from tbla
except
select t::tbla from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select tbla::text from tbla') R(t text);

there's a bunch of ways to do that -- you can also do the md5 on the
remote side so you can just send the digests.
select * from tbla
except
select (t::tbla).* from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select tbla::text from tbla') R(t text);

should also work.

This *might* work -- I didn't try.  It's been a while since I've used
stock dblink.
select * from tbla
except
select (t).* from dblink('host=dbhost dbname=otherdb user=myuser
password=mypwd'::text, 'select tbla from tbla') R(t tbla);

merlin

Re: md5 of table

From
Scott Marlowe
Date:
On Thu, Sep 1, 2011 at 7:56 AM, Achilleas Mantzios
<achill@matrix.gatewaynet.com> wrote:
> Στις Thursday 01 September 2011 15:50:21 ο/η Scott Marlowe έγραψε:
>> Really?  I was not aware of size limits of md5, what are they?
>>
>
> sorry, i was wrong. i dont know why i had this impression,
> just checked with a 43GB table on a freebsd machine and went fine.

Well, it might have been an older version or a 32 bit version or
something you were thinking of.  Sometimes a long memory helps us get
the wrong answer. :)

Re: md5 of table

From
Vincent de Phily
Date:
On Thursday 01 September 2011 11:47:24 Sim Zacks wrote:
> Is there a way to get an md5 or other hash of an entire table?
>
> I want to be able to easily compare 2 tables in different databases.
>
> I thought about using dblink and the EXCEPT query, but then I need to
> know the field list of each query result, which is a pain in the butt.
>
> If I could return an md5 of the entire table, then I could check if the
> tables have the same hash and be confident enough that the tables were
> identical.
>
>
> Thanks
> Sim

You might also want to take a look at

http://pgfoundry.org/projects/pg-comparator/

which can give a more nuanced view of db differences and tries to be smart
about performance. It looks a bit stale; I haven't used it in ages, but it
used to be a trusty part of our test suite.
--
Vincent de Phily

Re: md5 of table

From
Sim Zacks
Date:
>>>> I thought about using dblink and the EXCEPT query, but then I need to
>>>> know the field list of each query result, which is a pain in the butt.
>>> That is not correct.  As long as the table definitions are precisely
>>> the same, you can move records across dblink without specifying
>>> fields.  You do this by using record type for the composite which
>>> dblink sends across as text.
>>>
>>> merlin
>> Do you have a quick example? This is what I have tried:
>>
>> select * from tbla
>> except
>> select * from dblink('host=dbhost dbname=otherdb user=myuser
>> password=mypwd'::text, 'select * from tbla')
>>
>> The error I get back is:
>> ERROR:  a column definition list is required for functions returning
>> "record"
> sure:
> select tbla from tbla
> except
> select t::tbla from dblink('host=dbhost dbname=otherdb user=myuser
> password=mypwd'::text, 'select tbla::text from tbla') R(t text);
We tried something like that. Unfortunately, in 8.2 you can't cast a row
type as text.

> there's a bunch of ways to do that -- you can also do the md5 on the
> remote side so you can just send the digests.
> select * from tbla
> except
> select (t::tbla).* from dblink('host=dbhost dbname=otherdb user=myuser
> password=mypwd'::text, 'select tbla::text from tbla') R(t text);
>
> should also work.
>
> This *might* work -- I didn't try.  It's been a while since I've used
> stock dblink.
> select * from tbla
> except
> select (t).* from dblink('host=dbhost dbname=otherdb user=myuser
> password=mypwd'::text, 'select tbla from tbla') R(t tbla);
>
> merlin

This looks like it might work for us. At least I would only need the
table name for the field list instead of the entire column list.