Re: indexing longish string - Mailing list pgsql-sql

From Isaac Dover
Subject Re: indexing longish string
Date
Msg-id AANLkTi=u6WiqozCu3ghLjSQQ_S8vCxejEnbAo8xhh5rF@mail.gmail.com
Whole thread Raw
In response to Re: indexing longish string  (Kenneth Marshall <ktm@rice.edu>)
List pgsql-sql
Hi,
 
While hashing is certainly a good idea, you really should consider some issues well before you get to that point. Trust me, this could save you some headaches. First, though you're probably already aware, two XML documents can be the same document, but with very different literal representations. By the XML spec, these two DOMs would be equal (readers would treat them somewhat differently):
 
<aaa b="123" c="55" />
and
<aaa      c="55"       b="123"   ></aaa>
 
If you're looking to compare the strings literally, then you have to make sure that you canonicalize both the persisted string as well as the comparison in your queries. I first encountered this problem when storing XML as a vendor's data type of XML rather than varchar. Upon inserting, the database engine parsed the string into DOM for the XML data type so that XPath and XQuery statements could be processed. This is all fine and dandy until I retrieved the literal representation. The document <xxx></xxx> was transformed, as in the above example, to the literal <xxx />, but my select was comparing using the former string. Adding to the problem was that pretty much every tool that renders XML tends to hide these sort of differences in an effort to be "friendly" (save Vim, of course :). It took a bit of time for me to notice these small differences. Also keep in mind that document order, namespaces, and space normalization are other important considerations.
 
If you can store as DOM, you might experiment a bit to see if the process of canonicalizing is more/less efficient than just XPath'ing. Though, indexing a hash would probably be most efficient but at the risk of all that jibberish I just typed.
 
More info here:
 
Thanks, and good luck!
Isaac
 
On Tue, Nov 30, 2010 at 1:36 PM, Kenneth Marshall <ktm@rice.edu> wrote:
You can use a hash index for this. It's drawback is that it is not
yet WAL enabled and if your DB crashes you will need to rebuild the
index to fix the corruption. It works well(only) with equality
searches. If it is a scenario where you must have WAL, use a
function index based on the hash of the string.

Cheers,
Ken

pgsql-sql by date:

Previous
From: Kenneth Marshall
Date:
Subject: Re: indexing longish string
Next
From: Peter Eisentraut
Date:
Subject: Re: How strings are sorted by LC_COLLATE specifically?