Re: indexing longish string - Mailing list pgsql-sql

From Isaac Dover
Subject Re: indexing longish string
Date
Msg-id AANLkTikEpYW3icgh+qy6e+QnJjWOnrYfs4WW0zo_Xxjj@mail.gmail.com
Whole thread Raw
In response to indexing longish string  (Rob Sargent <robjsargent@gmail.com>)
List pgsql-sql
No problem, sir, hopefully I could help. I failed to mention that I've discovered some bugs in the PostgreSQL 8.4 XML implementation that forced me to take pause and ultimately forego XML with Postgres. I haven't looked at 9 yet, but considering the current lack of interest and/or disdain so many have for XML, I'm not anticipating great improvements. Of course, this is not to knock the db - Pg is awesome. But I'm aware that we all have limited time and there are other more important features that far more people desire. So, you are probably better off using string literals with all of the known risks. I've used some lame hashing techniques that simply stripped out everything except [:alnum:] and then simply alphabetized the characters and sometimes sent them through SHA/MD5 to get a unique representation. It seemed to work fine for my purposes.
 
Or take the old, proven approach of shredding documents into normalized tables and serializing them back out when you need them. ... ick !! ...
 
Thanks,
Isaac

On Tue, Nov 30, 2010 at 2:33 PM, Rob Sargent <robjsargent@gmail.com> wrote:
Hello Issac,

Thanks a ton.

In this bizarre little corner, I'm pretty sure I would use a text field
rather than any xml-friendly datatype.  The xml content itself will be
trivial yet dynamic :).  We need just a little more rigour than a hacky
delimited string implementation, but not much.  And we guarantee the
repeatability of the string for a given parameter set.

We could also do a manual tag-value implementation, and likely would if
we thought we were ever going to ask questions like "everything with tag
foo value less than ten", but that's not on the table. (Today...)

On 11/30/2010 12:21 PM, Isaac Dover wrote:
> Hi,
>
> While hashing is certainly a good idea, you really should consider some
> issues well before you get to that point. Trust me, this could save you
> some headaches. First, though you're probably already aware, two XML
> documents can be the same document, but with very different literal
> representations. By the XML spec, these two DOMs would be equal (readers
> would treat them somewhat differently):
>
> <aaa b="123" c="55" />
> and
> <aaa      c="55"       b="123"   ></aaa>
>
> If you're looking to compare the strings literally, then you have to
> make sure that you canonicalize both the persisted string as well as the
> comparison in your queries. I first encountered this problem when
> storing XML as a vendor's data type of XML rather than varchar. Upon
> inserting, the database engine parsed the string into DOM for the XML
> data type so that XPath and XQuery statements could be processed. This
> is all fine and dandy until I retrieved the literal representation. The
> document <xxx></xxx> was transformed, as in the above example, to the
> literal <xxx />, but my select was comparing using the former
> string. Adding to the problem was that pretty much every tool that
> renders XML tends to hide these sort of differences in an effort to be
> "friendly" (save Vim, of course :). It took a bit of time for me to
> notice these small differences. Also keep in mind that document order,
> namespaces, and space normalization are other important considerations.
>
> If you can store as DOM, you might experiment a bit to see if the
> process of canonicalizing is more/less efficient than just XPath'ing.
> Though, indexing a hash would probably be most efficient but at the risk
> of all that jibberish I just typed.
>
> More info here:
> http://www.w3.org/TR/xml-c14n11/
> http://www.w3.org/TR/2010/WD-xml-c14n2-20100831/
>
> Thanks, and good luck!
> Isaac
>
> On Tue, Nov 30, 2010 at 1:36 PM, Kenneth Marshall <ktm@rice.edu
> <mailto:ktm@rice.edu>> wrote:
>
>     You can use a hash index for this. It's drawback is that it is not
>     yet WAL enabled and if your DB crashes you will need to rebuild the
>     index to fix the corruption. It works well(only) with equality
>     searches. If it is a scenario where you must have WAL, use a
>     function index based on the hash of the string.
>
>     Cheers,
>     Ken
>

pgsql-sql by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: How strings are sorted by LC_COLLATE specifically?
Next
From: Samuel Gendler
Date:
Subject: aggregation question