Thread: checksum
Hi, I was wondering if PGSQL has a function similar to binary_checksum() of MS SQL Server 2000. It is pretty handy when it comes to compare rows of data instead of having to write long boolean expressions. binary_checksum() takes a list of fields and it returns an integer value which sumarize the row content. Thanks, Fed
> I was wondering if PGSQL has a function similar to binary_checksum() > of > MS SQL Server 2000. It is pretty handy when it comes to compare rows of > data instead of having to write long boolean expressions. > binary_checksum() takes a list of fields and it returns an integer > value > which sumarize the row content. On a similar note, I've found myself wanting an extended '=' operator meaning (a = b or (a is null and b is null)) same goal of course, for more general comparisons... d. -- David Helgason, Business Development et al., Over the Edge I/S (http://otee.dk) Direct line +45 2620 0663 Main line +45 3264 5049 On 26. sep 2004, at 19:58, Federico Balbi wrote:
David Helgason wrote: >> I was wondering if PGSQL has a function similar to >> binary_checksum() of >> MS SQL Server 2000. It is pretty handy when it comes to compare rows of >> data instead of having to write long boolean expressions. >> binary_checksum() takes a list of fields and it returns an integer value >> which sumarize the row content. > > You could use the md5 function.... such as : select md5(foo) from bar where baz = 2; J > On a similar note, I've found myself wanting an extended '=' operator > meaning > (a = b or (a is null and b is null)) > > same goal of course, for more general comparisons... > > d. -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
> David Helgason wrote: > > >> I was wondering if PGSQL has a function similar to > >> binary_checksum() of > >> MS SQL Server 2000. It is pretty handy when it comes to compare rows of > >> data instead of having to write long boolean expressions. > >> binary_checksum() takes a list of fields and it returns an integer value > >> which sumarize the row content. > > > > > You could use the md5 function.... such as : > > select md5(foo) from bar where baz = 2; Looks like md5() takes only a string. I need to pass alist of fields instead. I was looking at the documentattion and I think I can write soemthing like: field1, field2, ..., fieldn = expr1, expr2, ..., exprn This way one operator will check all the fields for equality. Fed
You could use the md5 function.... such as : select md5(foo) from bar where baz = 2;Looks like md5() takes only a string. I need to pass alist of fields instead. I was looking at the documentattion and I think I can write soemthing like: field1, field2, ..., fieldn = expr1, expr2, ..., exprn This way one operator will check all the fields for equality.
Maybe I am missing what you are saying, but you can md5() the data column... So you could do:
select one,two from foo where md5(one) = 'e4da3b7fbbce2345d7772b0674a318d5';
or
select one,two from foo where md5(one) = $1; or whatever.
for example....
Sincerely,
Joshua D. Drake
Fed ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
-- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> wrote: > > I was wondering if PGSQL has a function similar to binary_checksum() > > of > > MS SQL Server 2000. It is pretty handy when it comes to compare rows of > > data instead of having to write long boolean expressions. > > binary_checksum() takes a list of fields and it returns an integer > > value > > which sumarize the row content. > As noted, you can use the md5(text) function with the || (concat) operator > On a similar note, I've found myself wanting an extended '=' operator > meaning > (a = b or (a is null and b is null)) > Setting 'transform_null_equals' to true in postgresql.conf should do what you want. --miker > same goal of course, for more general comparisons... > > d. > -- > David Helgason, > Business Development et al., > Over the Edge I/S (http://otee.dk) > Direct line +45 2620 0663 > Main line +45 3264 5049 > On 26. sep 2004, at 19:58, Federico Balbi wrote: > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Mon, 27 Sep 2004, Mike Rylander wrote: > On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> wrote: > > On a similar note, I've found myself wanting an extended '=' operator > > meaning > > (a = b or (a is null and b is null)) > > > > Setting 'transform_null_equals' to true in postgresql.conf should do > what you want. Unfortunately, it probably won't. That only changes the explicit token sequence = NULL into an IS NULL, it won't help if you're doing a=b where a or b may be NULL. The original does appear to be equivalent to "not(a is distinct from b)", although I'm not sure that's necessarily easier to use than the above.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> wrote: > > > > On a similar note, I've found myself wanting an extended '=' operator > > > meaning > > > (a = b or (a is null and b is null)) > > The original does appear to be equivalent to "not(a is distinct from b)", > although I'm not sure that's necessarily easier to use than the above. I often do things like "coalesce(a,0) = coalesce(b,0)". (Or whatever value you know won't appear) Though for pretty small values of "often". It always makes me think twice about my data model when I find myself doing this. But there are definitely still cases where it's useful and as clean as anything else I could think of. -- greg
Even simpler: COALESCE( a = b, a IS NULL AND b IS NULL ) -- Dean Greg Stark wrote on 2004-09-27 08:17: >Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > > > On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> wrote: > > > > > > On a similar note, I've found myself wanting an extended '=' operator > > > > meaning > > > > (a = b or (a is null and b is null)) > > > > The original does appear to be equivalent to "not(a is distinct from b)", > > although I'm not sure that's necessarily easier to use than the above. > >I often do things like "coalesce(a,0) = coalesce(b,0)". >(Or whatever value you know won't appear) > >Though for pretty small values of "often". It always makes me think twice >about my data model when I find myself doing this. But there are definitely >still cases where it's useful and as clean as anything else I could think of. > >-- >greg > > >---------------------------(end of broadcast)--------------------------- >TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
On 27. sep 2004, at 22:08, Dean Gibson (DB Administrator) wrote: > Greg Stark wrote on 2004-09-27 08:17: >> Stephan Szabo <sszabo@megazone.bigpanda.com> writes: >> >> On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> >> wrote: >> >>> On a similar note, I've found myself wanting an extended '=' >> operator >> >>> meaning >> >>> (a = b or (a is null and b is null)) >> > >> > The original does appear to be equivalent to "not(a is distinct >> from b)", >> > although I'm not sure that's necessarily easier to use than the >> above. >> >> I often do things like "coalesce(a,0) = coalesce(b,0)". >> (Or whatever value you know won't appear) >> > Even simpler: COALESCE( a = b, a IS NULL AND b IS NULL ) I'm not quite sure what is being accomplished here... My original expression wasn't that bad, just clunky. I'd prefer a === b or (a samevalue b), but the above just complicates matters. Also, a 'set' command outside the expression goes completely against the idea, that certain fields have 'null' as a legal, comparable value, while others do not. Anyway, idle speculation :) d.
On Mon, 27 Sep 2004, David Helgason wrote: > On 27. sep 2004, at 22:08, Dean Gibson (DB Administrator) wrote: >> Greg Stark wrote on 2004-09-27 08:17: >>> Stephan Szabo <sszabo@megazone.bigpanda.com> writes: >>> >> On Sun, 26 Sep 2004 20:16:52 +0200, David Helgason <david@uti.is> >>> wrote: >>> >>> On a similar note, I've found myself wanting an extended '=' operator >>> >>> meaning >>> >>> (a = b or (a is null and b is null)) >>> > >>> > The original does appear to be equivalent to "not(a is distinct from >>> b)", >>> > although I'm not sure that's necessarily easier to use than the above. >>> >>> I often do things like "coalesce(a,0) = coalesce(b,0)". >>> (Or whatever value you know won't appear) >>> >> Even simpler: COALESCE( a = b, a IS NULL AND b IS NULL ) > > I'm not quite sure what is being accomplished here... My original expression > wasn't that bad, just clunky. I'd prefer a === b or (a samevalue b), but the > above just complicates matters. Also, a 'set' command outside the expression > goes completely against the idea, that certain fields have 'null' as a legal, > comparable value, while others do not. > > Anyway, idle speculation :) > > d. (a = b or (a is null and b is null)) that raises a flag for me. It seems that NULL is used as a special value, which is not. NULL just means 'unknown', nothing more, nothing less. That's why any boolean expression involving a NULL is NULL. 'unknown' when compared to anything else just gives 'unknown'. It means: "I can't tell whether the two expressions are the same, because I don't know the value of one of them." Note that that's different from "I know they're different". They _could_ be equal, we just don't know. 'unknown' compared to 'unknown' gives of course 'unknown', since it's just a special case of the above. Not knowing the value of both the expressions doesn't help much. The answer can never be 'I know they are the same.' Not knowing one is enough to say you don't know the result of the comparison. I can hardly imagine why you may want to select all rows that you are certain have equal values, plus others that have potentially different values because they are both unknown. It smells like a design problem. BTW, coalesce(a,0) = coalesce(b,0) is wrong, since it assumes 0 is a special value, never used in the table. If so, it's better use it from the start instead of NULL for those special rows. That espression is true for the following rows: a | b ---+--- 1 | 1 2 | 2 | 0 | | 0 the last two rows are wrongly selected. coalesce(a = b, a is null and b is null) is correct, and maybe slightly better than the original (a = b) or (a is null and b is null) if the implementation is smart enough to evaluate its arguments only when needed. The or operator needs to evaluate the right side when the left side is either false or null, COALESCE only when it's null. I think the docs mention that. .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
Marco Colombo <pgsql@esiway.net> writes: > (a = b or (a is null and b is null)) > > that raises a flag for me. It seems that NULL is used as a special value, > which is not. Well, as I said, it raised a flag for me too. However, it's not good to be too dogmatic about things. General rules are useful guiding principles but you have to recognize when it's worth it to break them. We don't know enough about his problem to say another approach would be any better. For example, in one application I have a table that *does* have "unknown" values. However I do need to look up records that match criteria including having "unknown" values in specific positions. For most queries using NULL is convenient and it's perfectly appropriate. But I have queries like this user does and I use coalesce since I find the resulting expression much clearer than using the three-way logical expression above. Incidentally, coalesce(a,0)=coalesce(b,0) has the advantage over all the other suggestions that you can build an index on coalesce(a,0) and/or coalesce(b,0) and use them for the join or for individual record lookups. > BTW, > > coalesce(a,0) = coalesce(b,0) > > is wrong, since it assumes 0 is a special value, never used in the > table. If so, it's better use it from the start instead of NULL for > those special rows. I specifically said you had to use a special value in my suggestion. Saying something is "wrong" when it does what's needed just because it violates some abstract design principle is just short-sighted. Using 0 in the table might violate unique constraints or foreign key constraints. I try to avoid having a single quirky table propagate its quirkiness to the rest of the system. For example, creating a bogus "0" record in some other table just to satisfy the foreign key constraint then having the rest of the application have to work around this bogus record results in a much less workable system than simply using NULL instead of 0 for the special value. > coalesce(a = b, a is null and b is null) > > is correct, and maybe slightly better than the original > > (a = b) or (a is null and b is null) > > if the implementation is smart enough to evaluate its arguments only > when needed. The or operator needs to evaluate the right side when > the left side is either false or null, COALESCE only when it's null. > I think the docs mention that. Actually it's only the latter expression that will be able to avoid evaluating the extra expression, not the coalesce example. In any case the time to evaluate the "a is null and b is null" part is negligible. And the fact that neither can use any indexes is the only relevant performance question. It's possible that's not a concern, but if it is they both lose. That's one possible argument in favour of a === operator. It would be easy (I think?) to make === use a btree index without even having to build a functional index like with coalesce(a,0). -- greg
On Tue, 28 Sep 2004, Greg Stark wrote: > > Marco Colombo <pgsql@esiway.net> writes: > >> (a = b or (a is null and b is null)) >> >> that raises a flag for me. It seems that NULL is used as a special value, >> which is not. > > Well, as I said, it raised a flag for me too. However, it's not good to be too > dogmatic about things. General rules are useful guiding principles but you > have to recognize when it's worth it to break them. We don't know enough about > his problem to say another approach would be any better. The way SQL treats NULLs in boolean expressions is not a "general rule". It's just the way it is. I'm being pragmatic not dogmatic. > For example, in one application I have a table that *does* have "unknown" > values. However I do need to look up records that match criteria including > having "unknown" values in specific positions. For most queries using NULL is > convenient and it's perfectly appropriate. But I have queries like this user > does and I use coalesce since I find the resulting expression much clearer > than using the three-way logical expression above. Using special values is a mistake, but I agree that's being dogmatic. Using NULL as a special value is a way worse mistake. Three way logic it's there, there's nothing you can do about that. NULL can't be a value. That's why you can't use it in comparisons. As for the "dogma" part, I bet you're using special values only to model states, not values, and are using the same columns you're using to model values. Just add another column, life will be better. This has nothing to do with NULLs, BTW. > Incidentally, coalesce(a,0)=coalesce(b,0) has the advantage over all the other > suggestions that you can build an index on coalesce(a,0) and/or coalesce(b,0) > and use them for the join or for individual record lookups. > >> BTW, >> >> coalesce(a,0) = coalesce(b,0) >> >> is wrong, since it assumes 0 is a special value, never used in the >> table. If so, it's better use it from the start instead of NULL for >> those special rows. > > I specifically said you had to use a special value in my suggestion. Saying > something is "wrong" when it does what's needed just because it violates some > abstract design principle is just short-sighted. > > Using 0 in the table might violate unique constraints or foreign key > constraints. I try to avoid having a single quirky table propagate its > quirkiness to the rest of the system. _Unique_ constraints? I don't get this. 0 has no special meaning. 1, -1, 1000000 might violate unique constraints as well. Any value might. What's the point here? Same goes for foreign key constraints. Actually same goes for _any_ constraint. I agree that using a value you _know_ it's invalid due to some contraints allows you to do the "coalesce trick" safely. But this assumes there is at least _one_ invalid value. This is not true in general. In general, coalesce(a,0) = coalesce(b,0) is not the same of (a = b) or (a is null and b is null). I've even provided a fine example. It can't be a general equivalence. > For example, creating a bogus "0" record in some other table just to satisfy > the foreign key constraint then having the rest of the application have to > work around this bogus record results in a much less workable system than > simply using NULL instead of 0 for the special value. I'm lost here. I've never proposed to add 0 anywhere. I just wonder: "what if someone does?". In order to use 0 in coalesce(a,0) = coalesce(b,0) you have to make sure 0 is invalid for both a and b. If you knew -1 is invalid instead, you would use coalesce(a,-1) = coalesce(b,-1). But what if there's no invalid value? It might seem a "dogmatic" question, but my point is: why bother? Just get the model right. Mapping NULLs to 0 or -1 or whatever is meaningless, if the model is right. That's again the whole point. What you're proposing sounds like this: "I'd use 0 (or other special value) in the table, but that's not good cause it may break some constraint. So I use NULLs in place of my special value, and convert them at later time with coalesce(), so that I can compare them again." Now, that's abusing of NULLs. There's a reason why NULLs don't break foreign key constraints, and a reason why you can't compare them. Your use of coalesce(), your functional index, it's just placing a brown paperbag on the real problem, which is a wrong model. Don't play with NULLs, fix your schema. A badly designed model is not a matter of "general principles". It's a plain real-world mistake. And leads to any kind of acrobatic exercises in SQL to get the system work. One day you'll run into another query you've got a hard time to write or to make run efficently. NULLs are not meant to be "values" and should never be used as such. Anytime you feel the need of an index on them, or to compare them, you're treating them as values, and that's plain wrong. Not because of any dogma, but because that's how the system works. The _only_ way to use them as values is to abuse of them. > >> coalesce(a = b, a is null and b is null) >> >> is correct, and maybe slightly better than the original >> >> (a = b) or (a is null and b is null) >> >> if the implementation is smart enough to evaluate its arguments only >> when needed. The or operator needs to evaluate the right side when >> the left side is either false or null, COALESCE only when it's null. >> I think the docs mention that. > > Actually it's only the latter expression that will be able to avoid evaluating > the extra expression, not the coalesce example. In any case the time to ^^^^^^^^^^^^^^^^^^^^^^^^ Why not? > evaluate the "a is null and b is null" part is negligible. And the fact that > neither can use any indexes is the only relevant performance question. It's > possible that's not a concern, but if it is they both lose. We agree here. I wrote "maybe slightly better". > That's one possible argument in favour of a === operator. It would be easy (I > think?) to make === use a btree index without even having to build a > functional index like with coalesce(a,0). A new, non standard operator just to support badly modeled schemas? No, thanks. In C, you can store a (small) string into a pointer type variable, if you're careful enough, and know what you're doing. It might work for you, but it's still abusing of the language: you can't expect the language to _support_ that. And for sure, you can't sell it as a general solution. I'm not against abusing of the db, nor playing dirty tricks, if that fits your needs. You're free to design your db the way you like and face the cost of a careful design or of later SQL gymnastics. I'm fine, as long as you don't ask for syntactic sugar to support those "features". .TM. -- ____/ ____/ / / / / Marco Colombo ___/ ___ / / Technical Manager / / / ESI s.r.l. _____/ _____/ _/ Colombo@ESI.it
Marco Colombo <pgsql@esiway.net> writes: > I'm not against abusing of the db, nor playing dirty tricks, if that fits > your needs. You're free to design your db the way you like and face > the cost of a careful design or of later SQL gymnastics. I'm fine, > as long as you don't ask for syntactic sugar to support those "features". My point is that you're making judgements about his schema without actually knowing what you're talking about. For all we know his schema is entirely reasonable and it's the query that has unusual requirements. Spouting general design principles that may or may not apply as being iron-clad rules and saying it's just wrong to break them is wilful blindness. Database modelling is not something you can do by holding up some textbook and screaming "third normal form" until the developers trying to get work done cower in submission. There are lots of times when breaking or bending the rules is entirely reasonable and blindly following them is simply a waste of time. For example, I have a table that uses NULLs to represent absent data. In 90% of the queries three value logic is just exactly what's needed. In any case they each have foreign key dependencies and having special values to represent the absent values would be a major pain. It would require satisfying the foreign keys with bogus records. However I have queries that have to match provided data with other records, including having missing data in the same position. For such a query I need to break the usual model of three value logic and write something similar to what this user needs. In my case no index would really be reasonable since there are half a dozen such fields, but in general there's no reason an index shouldn't be available for such cases. -- greg
Hi, not sure I quite understand, but could you not just concatenate all the fields and test on that? like: select test1, test2, test3 from testtable where test1 || test2 || test3 = 'whateveryouwanttotest' you might have to do some typecasting/conversion on data types other than strings... regards Jonas:)) -----Original Message----- From: Federico Balbi [mailto:fbalbi@cs.utsa.edu] Sent: 26. september 2004 20:32 To: postgresql-general Subject: Re: checksum > David Helgason wrote: > > >> I was wondering if PGSQL has a function similar to > >> binary_checksum() of > >> MS SQL Server 2000. It is pretty handy when it comes to compare rows of > >> data instead of having to write long boolean expressions. > >> binary_checksum() takes a list of fields and it returns an integer value > >> which sumarize the row content. > > > > > You could use the md5 function.... such as : > > select md5(foo) from bar where baz = 2; Looks like md5() takes only a string. I need to pass alist of fields instead. I was looking at the documentattion and I think I can write soemthing like: field1, field2, ..., fieldn = expr1, expr2, ..., exprn This way one operator will check all the fields for equality. Fed