Thread: checksum

checksum

From
Federico Balbi
Date:
Hi,
  I was wondering if PGSQL has a function similar to binary_checksum() of
MS SQL Server 2000. It is pretty handy when it comes to compare rows of
data instead of having to write long boolean expressions.
binary_checksum() takes a list of fields and it returns an integer value
which sumarize the row content.

Thanks,
Fed


Re: checksum

From
David Helgason
Date:
>   I was wondering if PGSQL has a function similar to binary_checksum()
> of
> MS SQL Server 2000. It is pretty handy when it comes to compare rows of
> data instead of having to write long boolean expressions.
> binary_checksum() takes a list of fields and it returns an integer
> value
> which sumarize the row content.

On a similar note, I've found myself wanting an extended '=' operator
meaning
    (a = b or (a is null and b is null))

same goal of course, for more general comparisons...

d.
--
David Helgason,
Business Development et al.,
Over the Edge I/S (http://otee.dk)
Direct line +45 2620 0663
Main line +45 3264 5049
On 26. sep 2004, at 19:58, Federico Balbi wrote:


Re: checksum

From
"Joshua D. Drake"
Date:
David Helgason wrote:

>>   I was wondering if PGSQL has a function similar to
>> binary_checksum() of
>> MS SQL Server 2000. It is pretty handy when it comes to compare rows of
>> data instead of having to write long boolean expressions.
>> binary_checksum() takes a list of fields and it returns an integer value
>> which sumarize the row content.
>
>
You could use the md5 function.... such as :

select md5(foo) from bar where baz = 2;

J




> On a similar note, I've found myself wanting an extended '=' operator
> meaning
>     (a = b or (a is null and b is null))
>
> same goal of course, for more general comparisons...
>
> d.



--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL


Re: checksum

From
Federico Balbi
Date:
> David Helgason wrote:
>
> >>   I was wondering if PGSQL has a function similar to
> >> binary_checksum() of
> >> MS SQL Server 2000. It is pretty handy when it comes to compare rows of
> >> data instead of having to write long boolean expressions.
> >> binary_checksum() takes a list of fields and it returns an integer value
> >> which sumarize the row content.
> >
> >
> You could use the md5 function.... such as :
>
> select md5(foo) from bar where baz = 2;

Looks like md5() takes only a string. I need to pass alist of fields
instead. I was looking at the documentattion and I think I can write
soemthing like:

field1, field2, ..., fieldn = expr1, expr2, ..., exprn

This way one operator will check all the fields for equality.

Fed


Re: checksum

From
"Joshua D. Drake"
Date:

You could use the md5 function.... such as :

select md5(foo) from bar where baz = 2;   
Looks like md5() takes only a string. I need to pass alist of fields
instead. I was looking at the documentattion and I think I can write
soemthing like:

field1, field2, ..., fieldn = expr1, expr2, ..., exprn

This way one operator will check all the fields for equality.
 

Maybe I am missing what you are saying, but you can md5() the data column... So you could do:

select one,two from foo where md5(one) = 'e4da3b7fbbce2345d7772b0674a318d5';

or

select one,two from foo where md5(one) = $1; or whatever.

for example....

Sincerely,

Joshua D. Drake



Fed


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings 


-- 
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL

Re: checksum

From
Mike Rylander
Date:
On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> wrote:
> >   I was wondering if PGSQL has a function similar to binary_checksum()
> > of
> > MS SQL Server 2000. It is pretty handy when it comes to compare rows of
> > data instead of having to write long boolean expressions.
> > binary_checksum() takes a list of fields and it returns an integer
> > value
> > which sumarize the row content.
>

As noted, you can use the md5(text) function with the || (concat) operator

> On a similar note, I've found myself wanting an extended '=' operator
> meaning
>         (a = b or (a is null and b is null))
>

Setting 'transform_null_equals' to true in postgresql.conf should do
what you want.

--miker

> same goal of course, for more general comparisons...
>
> d.
> --
> David Helgason,
> Business Development et al.,
> Over the Edge I/S (http://otee.dk)
> Direct line +45 2620 0663
> Main line +45 3264 5049
> On 26. sep 2004, at 19:58, Federico Balbi wrote:
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

Null comparisons (was Re: checksum)

From
Stephan Szabo
Date:
On Mon, 27 Sep 2004, Mike Rylander wrote:

> On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> wrote:

> > On a similar note, I've found myself wanting an extended '=' operator
> > meaning
> >         (a = b or (a is null and b is null))
> >
>
> Setting 'transform_null_equals' to true in postgresql.conf should do
> what you want.

Unfortunately, it probably won't.  That only changes the explicit token
sequence = NULL into an IS NULL, it won't help if you're doing a=b where a
or b may be NULL.

The original does appear to be equivalent to "not(a is distinct from b)",
although I'm not sure that's necessarily easier to use than the above.

Re: Null comparisons (was Re: checksum)

From
Greg Stark
Date:
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

> > On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> wrote:
>
> > > On a similar note, I've found myself wanting an extended '=' operator
> > > meaning
> > >         (a = b or (a is null and b is null))
>
> The original does appear to be equivalent to "not(a is distinct from b)",
> although I'm not sure that's necessarily easier to use than the above.

I often do things like "coalesce(a,0) = coalesce(b,0)".
(Or whatever value you know won't appear)

Though for pretty small values of "often". It always makes me think twice
about my data model when I find myself doing this. But there are definitely
still cases where it's useful and as clean as anything else I could think of.

--
greg

Re: Null comparisons (was Re: checksum)

From
"Dean Gibson (DB Administrator)"
Date:
Even simpler:  COALESCE( a = b, a IS NULL AND b IS NULL )

-- Dean

Greg Stark wrote on 2004-09-27 08:17:

>Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
>
> > > On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> wrote:
> >
> > > > On a similar note, I've found myself wanting an extended '=' operator
> > > > meaning
> > > >         (a = b or (a is null and b is null))
> >
> > The original does appear to be equivalent to "not(a is distinct from b)",
> > although I'm not sure that's necessarily easier to use than the above.
>
>I often do things like "coalesce(a,0) = coalesce(b,0)".
>(Or whatever value you know won't appear)
>
>Though for pretty small values of "often". It always makes me think twice
>about my data model when I find myself doing this. But there are definitely
>still cases where it's useful and as clean as anything else I could think of.
>
>--
>greg
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match


Re: Null comparisons (was Re: checksum)

From
David Helgason
Date:
On 27. sep 2004, at 22:08, Dean Gibson (DB Administrator) wrote:
> Greg Stark wrote on 2004-09-27 08:17:
>> Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
>> >> On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is>
>> wrote:
>> >>> On a similar note, I've found myself wanting an extended '='
>> operator
>> >>> meaning
>> >>>         (a = b or (a is null and b is null))
>> >
>> > The original does appear to be equivalent to "not(a is distinct
>> from b)",
>> > although I'm not sure that's necessarily easier to use than the
>> above.
>>
>> I often do things like "coalesce(a,0) = coalesce(b,0)".
>> (Or whatever value you know won't appear)
>>
> Even simpler:  COALESCE( a = b, a IS NULL AND b IS NULL )

I'm not quite sure what is being accomplished here... My original
expression wasn't that bad, just clunky. I'd prefer a === b or (a
samevalue b), but the above just complicates matters. Also, a 'set'
command outside the expression goes completely against the idea, that
certain fields have 'null' as a legal, comparable value, while others
do not.

Anyway, idle speculation :)

d.


Re: Null comparisons (was Re: checksum)

From
Marco Colombo
Date:
On Mon, 27 Sep 2004, David Helgason wrote:

> On 27. sep 2004, at 22:08, Dean Gibson (DB Administrator) wrote:
>> Greg Stark wrote on 2004-09-27 08:17:
>>> Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
>>> >> On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is>
>>> wrote:
>>> >>> On a similar note, I've found myself wanting an extended '=' operator
>>> >>> meaning
>>> >>>         (a = b or (a is null and b is null))
>>> >
>>> > The original does appear to be equivalent to "not(a is distinct from
>>> b)",
>>> > although I'm not sure that's necessarily easier to use than the above.
>>>
>>> I often do things like "coalesce(a,0) = coalesce(b,0)".
>>> (Or whatever value you know won't appear)
>>>
>> Even simpler:  COALESCE( a = b, a IS NULL AND b IS NULL )
>
> I'm not quite sure what is being accomplished here... My original expression
> wasn't that bad, just clunky. I'd prefer a === b or (a samevalue b), but the
> above just complicates matters. Also, a 'set' command outside the expression
> goes completely against the idea, that certain fields have 'null' as a legal,
> comparable value, while others do not.
>
> Anyway, idle speculation :)
>
> d.

(a = b or (a is null and b is null))

that raises a flag for me. It seems that NULL is used as a special value,
which is not. NULL just means 'unknown', nothing more, nothing less.
That's why any boolean expression involving a NULL is NULL.

'unknown' when compared to anything else just gives 'unknown'.
It means: "I can't tell whether the two expressions are the same, because
I don't know the value of one of them."
Note that that's different from "I know they're different".
They _could_ be equal, we just don't know.

'unknown' compared to 'unknown' gives of course 'unknown', since it's
just a special case of the above. Not knowing the value of both the
expressions doesn't help much. The answer can never be 'I know they
are the same.' Not knowing one is enough to say you don't know the
result of the comparison.

I can hardly imagine why you may want to select all rows that you
are certain have equal values, plus others that have potentially
different values because they are both unknown. It smells like a
design problem.


BTW,

   coalesce(a,0) = coalesce(b,0)

is wrong, since it assumes 0 is a special value, never used in the
table. If so, it's better use it from the start instead of NULL for
those special rows. That espression is true for the following rows:
  a | b
---+---
  1 | 1
  2 | 2
    |
  0 |
    | 0

the last two rows are wrongly selected.


   coalesce(a = b, a is null and b is null)

is correct, and maybe slightly better than the original

   (a = b) or (a is null and b is null)

if the implementation is smart enough to evaluate its arguments only
when needed. The or operator needs to evaluate the right side when
the left side is either false or null, COALESCE only when it's null.
I think the docs mention that.

.TM.
--
       ____/  ____/   /
      /      /       /            Marco Colombo
     ___/  ___  /   /              Technical Manager
    /          /   /             ESI s.r.l.
  _____/ _____/  _/               Colombo@ESI.it

Re: Null comparisons (was Re: checksum)

From
Greg Stark
Date:
Marco Colombo <pgsql@esiway.net> writes:

> (a = b or (a is null and b is null))
>
> that raises a flag for me. It seems that NULL is used as a special value,
> which is not.

Well, as I said, it raised a flag for me too. However, it's not good to be too
dogmatic about things. General rules are useful guiding principles but you
have to recognize when it's worth it to break them. We don't know enough about
his problem to say another approach would be any better.

For example, in one application I have a table that *does* have "unknown"
values. However I do need to look up records that match criteria including
having "unknown" values in specific positions. For most queries using NULL is
convenient and it's perfectly appropriate. But I have queries like this user
does and I use coalesce since I find the resulting expression much clearer
than using the three-way logical expression above.

Incidentally, coalesce(a,0)=coalesce(b,0) has the advantage over all the other
suggestions that you can build an index on coalesce(a,0) and/or coalesce(b,0)
and use them for the join or for individual record lookups.

> BTW,
>
>    coalesce(a,0) = coalesce(b,0)
>
> is wrong, since it assumes 0 is a special value, never used in the
> table. If so, it's better use it from the start instead of NULL for
> those special rows.

I specifically said you had to use a special value in my suggestion. Saying
something is "wrong" when it does what's needed just because it violates some
abstract design principle is just short-sighted.

Using 0 in the table might violate unique constraints or foreign key
constraints. I try to avoid having a single quirky table propagate its
quirkiness to the rest of the system.

For example, creating a bogus "0" record in some other table just to satisfy
the foreign key constraint then having the rest of the application have to
work around this bogus record results in a much less workable system than
simply using NULL instead of 0 for the special value.

>    coalesce(a = b, a is null and b is null)
>
> is correct, and maybe slightly better than the original
>
>    (a = b) or (a is null and b is null)
>
> if the implementation is smart enough to evaluate its arguments only
> when needed. The or operator needs to evaluate the right side when
> the left side is either false or null, COALESCE only when it's null.
> I think the docs mention that.

Actually it's only the latter expression that will be able to avoid evaluating
the extra expression, not the coalesce example. In any case the time to
evaluate the "a is null and b is null" part is negligible. And the fact that
neither can use any indexes is the only relevant performance question. It's
possible that's not a concern, but if it is they both lose.

That's one possible argument in favour of a === operator. It would be easy (I
think?) to make === use a btree index without even having to build a
functional index like with coalesce(a,0).

--
greg

Re: Null comparisons (was Re: checksum)

From
Marco Colombo
Date:
On Tue, 28 Sep 2004, Greg Stark wrote:

>
> Marco Colombo <pgsql@esiway.net> writes:
>
>> (a = b or (a is null and b is null))
>>
>> that raises a flag for me. It seems that NULL is used as a special value,
>> which is not.
>
> Well, as I said, it raised a flag for me too. However, it's not good to be too
> dogmatic about things. General rules are useful guiding principles but you
> have to recognize when it's worth it to break them. We don't know enough about
> his problem to say another approach would be any better.

The way SQL treats NULLs in boolean expressions is not a "general rule".
It's just the way it is. I'm being pragmatic not dogmatic.

> For example, in one application I have a table that *does* have "unknown"
> values. However I do need to look up records that match criteria including
> having "unknown" values in specific positions. For most queries using NULL is
> convenient and it's perfectly appropriate. But I have queries like this user
> does and I use coalesce since I find the resulting expression much clearer
> than using the three-way logical expression above.

Using special values is a mistake, but I agree that's being dogmatic.

Using NULL as a special value is a way worse mistake. Three way logic
it's there, there's nothing you can do about that. NULL can't be a value.
That's why you can't use it in comparisons.

As for the "dogma" part, I bet you're using special values only to
model states, not values, and are using the same columns you're using
to model values. Just add another column, life will be better.
This has nothing to do with NULLs, BTW.

> Incidentally, coalesce(a,0)=coalesce(b,0) has the advantage over all the other
> suggestions that you can build an index on coalesce(a,0) and/or coalesce(b,0)
> and use them for the join or for individual record lookups.
>
>> BTW,
>>
>>    coalesce(a,0) = coalesce(b,0)
>>
>> is wrong, since it assumes 0 is a special value, never used in the
>> table. If so, it's better use it from the start instead of NULL for
>> those special rows.
>
> I specifically said you had to use a special value in my suggestion. Saying
> something is "wrong" when it does what's needed just because it violates some
> abstract design principle is just short-sighted.
>
> Using 0 in the table might violate unique constraints or foreign key
> constraints. I try to avoid having a single quirky table propagate its
> quirkiness to the rest of the system.

_Unique_ constraints? I don't get this. 0 has no special meaning.
1, -1, 1000000 might violate unique constraints as well. Any value might.
What's the point here?  Same goes for foreign key constraints.
Actually same goes for _any_ constraint.

I agree that using a value you _know_ it's invalid due to some contraints
allows you to do the "coalesce trick" safely. But this assumes there is
at least _one_ invalid value. This is not true in general. In general,

     coalesce(a,0) = coalesce(b,0)

is not the same of

     (a = b) or (a is null and b is null).

I've even provided a fine example. It can't be a general equivalence.

> For example, creating a bogus "0" record in some other table just to satisfy
> the foreign key constraint then having the rest of the application have to
> work around this bogus record results in a much less workable system than
> simply using NULL instead of 0 for the special value.

I'm lost here. I've never proposed to add 0 anywhere. I just wonder:
"what if someone does?". In order to use 0 in coalesce(a,0) = coalesce(b,0)
you have to make sure 0 is invalid for both a and b. If you knew -1 is
invalid instead, you would use coalesce(a,-1) = coalesce(b,-1).
But what if there's no invalid value?

It might seem a "dogmatic" question, but my point is: why bother?
Just get the model right. Mapping NULLs to 0 or -1 or whatever is
meaningless, if the model is right.

That's again the whole point. What you're proposing sounds like this:
  "I'd use 0 (or other special value) in the table, but that's not good
   cause it may break some constraint.  So I use NULLs in place of my
   special value, and convert them at later time with coalesce(), so that
   I can compare them again."

Now, that's abusing of NULLs. There's a reason why NULLs don't break
foreign key constraints, and a reason why you can't compare them.
Your use of coalesce(), your functional index, it's just placing a
brown paperbag on the real problem, which is a wrong model. Don't play
with NULLs, fix your schema. A badly designed model is not a matter
of "general principles". It's a plain real-world mistake. And leads
to any kind of acrobatic exercises in SQL to get the system work.
One day you'll run into another query you've got a hard time to write
or to make run efficently.

NULLs are not meant to be "values" and should never be used as such.
Anytime you feel the need of an index on them, or to compare them,
you're treating them as values, and that's plain wrong. Not because
of any dogma, but because that's how the system works. The _only_
way to use them as values is to abuse of them.

>
>>    coalesce(a = b, a is null and b is null)
>>
>> is correct, and maybe slightly better than the original
>>
>>    (a = b) or (a is null and b is null)
>>
>> if the implementation is smart enough to evaluate its arguments only
>> when needed. The or operator needs to evaluate the right side when
>> the left side is either false or null, COALESCE only when it's null.
>> I think the docs mention that.
>
> Actually it's only the latter expression that will be able to avoid evaluating
> the extra expression, not the coalesce example. In any case the time to
             ^^^^^^^^^^^^^^^^^^^^^^^^
Why not?

> evaluate the "a is null and b is null" part is negligible. And the fact that
> neither can use any indexes is the only relevant performance question. It's
> possible that's not a concern, but if it is they both lose.

We agree here. I wrote "maybe slightly better".

> That's one possible argument in favour of a === operator. It would be easy (I
> think?) to make === use a btree index without even having to build a
> functional index like with coalesce(a,0).

A new, non standard operator just to support badly modeled schemas?
No, thanks.

In C, you can store a (small) string into a pointer type variable,
if you're careful enough, and know what you're doing. It might work
for you, but it's still abusing of the language: you can't
expect the language to _support_ that. And for sure, you can't sell
it as a general solution.

I'm not against abusing of the db, nor playing dirty tricks, if that fits
your needs. You're free to design your db the way you like and face
the cost of a careful design or of later SQL gymnastics. I'm fine,
as long as you don't ask for syntactic sugar to support those "features".

.TM.
--
       ____/  ____/   /
      /      /       /            Marco Colombo
     ___/  ___  /   /              Technical Manager
    /          /   /             ESI s.r.l.
  _____/ _____/  _/               Colombo@ESI.it


Re: Null comparisons (was Re: checksum)

From
Greg Stark
Date:
Marco Colombo <pgsql@esiway.net> writes:

> I'm not against abusing of the db, nor playing dirty tricks, if that fits
> your needs. You're free to design your db the way you like and face
> the cost of a careful design or of later SQL gymnastics. I'm fine,
> as long as you don't ask for syntactic sugar to support those "features".

My point is that you're making judgements about his schema without actually
knowing what you're talking about. For all we know his schema is entirely
reasonable and it's the query that has unusual requirements. Spouting general
design principles that may or may not apply as being iron-clad rules and
saying it's just wrong to break them is wilful blindness.

Database modelling is not something you can do by holding up some textbook and
screaming "third normal form" until the developers trying to get work done
cower in submission. There are lots of times when breaking or bending the
rules is entirely reasonable and blindly following them is simply a waste of
time.

For example, I have a table that uses NULLs to represent absent data. In 90%
of the queries three value logic is just exactly what's needed. In any case
they each have foreign key dependencies and having special values to represent
the absent values would be a major pain. It would require satisfying the
foreign keys with bogus records.

However I have queries that have to match provided data with other records,
including having missing data in the same position. For such a query I need to
break the usual model of three value logic and write something similar to what
this user needs. In my case no index would really be reasonable since there
are half a dozen such fields, but in general there's no reason an index
shouldn't be available for such cases.

--
greg

Re: checksum

From
"Henriksen, Jonas F"
Date:
Hi,

not sure I quite understand, but could you not just concatenate all the fields and test on that? like:

select test1, test2, test3 from testtable
where test1 || test2 || test3 = 'whateveryouwanttotest'

you might have to do some typecasting/conversion on data types other than strings...

regards Jonas:))

-----Original Message-----
From: Federico Balbi [mailto:fbalbi@cs.utsa.edu]
Sent: 26. september 2004 20:32
To: postgresql-general
Subject: Re: checksum


> David Helgason wrote:
>
> >>   I was wondering if PGSQL has a function similar to
> >> binary_checksum() of
> >> MS SQL Server 2000. It is pretty handy when it comes to compare rows of
> >> data instead of having to write long boolean expressions.
> >> binary_checksum() takes a list of fields and it returns an integer value
> >> which sumarize the row content.
> >
> >
> You could use the md5 function.... such as :
>
> select md5(foo) from bar where baz = 2;

Looks like md5() takes only a string. I need to pass alist of fields
instead. I was looking at the documentattion and I think I can write
soemthing like:

field1, field2, ..., fieldn = expr1, expr2, ..., exprn

This way one operator will check all the fields for equality.

Fed