Thread: indexing longish string

indexing longish string

From
Rob Sargent
Date:
Were we to create a table which included a text field for a small block
of xml (100-1000 chars worth), would an index on that field be useful
against exact match queries?

We're wondering if a criterion such as "where 'a string expected to be
of size range 100 to 500' = tabelWithStrings.stringSearched" would make
good use of an index on "stringSearched" column.

The key is that there will only be exact match queries.

Thanks for your thoughts.




Re: indexing longish string

From
jose
Date:
Why don't you use some type of hash like md5 for indexing ?

2010/11/30 Rob Sargent <robjsargent@gmail.com>:
> Were we to create a table which included a text field for a small block
> of xml (100-1000 chars worth), would an index on that field be useful
> against exact match queries?
>
> We're wondering if a criterion such as "where 'a string expected to be
> of size range 100 to 500' = tabelWithStrings.stringSearched" would make
> good use of an index on "stringSearched" column.
>
> The key is that there will only be exact match queries.
>
> Thanks for your thoughts.
>
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>


Re: indexing longish string

From
Rob Sargent
Date:
If the performance against an index doesn't cut it, we would be forced
to choose just such an implementation, but if pg can do it straight up
that would be less work for us.  A good thing, to be sure.

On 11/30/2010 10:50 AM, jose wrote:
> Why don't you use some type of hash like md5 for indexing ?
> 
> 2010/11/30 Rob Sargent <robjsargent@gmail.com>:
>> Were we to create a table which included a text field for a small block
>> of xml (100-1000 chars worth), would an index on that field be useful
>> against exact match queries?
>>
>> We're wondering if a criterion such as "where 'a string expected to be
>> of size range 100 to 500' = tabelWithStrings.stringSearched" would make
>> good use of an index on "stringSearched" column.
>>
>> The key is that there will only be exact match queries.
>>
>> Thanks for your thoughts.
>>
>>
>>
>> --
>> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-sql
>>
> 


Re: indexing longish string

From
Kenneth Marshall
Date:
You can use a hash index for this. It's drawback is that it is not
yet WAL enabled and if your DB crashes you will need to rebuild the
index to fix the corruption. It works well(only) with equality
searches. If it is a scenario where you must have WAL, use a
function index based on the hash of the string.

Cheers,
Ken

On Tue, Nov 30, 2010 at 10:57:31AM -0700, Rob Sargent wrote:
> If the performance against an index doesn't cut it, we would be forced
> to choose just such an implementation, but if pg can do it straight up
> that would be less work for us.  A good thing, to be sure.
> 
> On 11/30/2010 10:50 AM, jose wrote:
> > Why don't you use some type of hash like md5 for indexing ?
> > 
> > 2010/11/30 Rob Sargent <robjsargent@gmail.com>:
> >> Were we to create a table which included a text field for a small block
> >> of xml (100-1000 chars worth), would an index on that field be useful
> >> against exact match queries?
> >>
> >> We're wondering if a criterion such as "where 'a string expected to be
> >> of size range 100 to 500' = tabelWithStrings.stringSearched" would make
> >> good use of an index on "stringSearched" column.
> >>
> >> The key is that there will only be exact match queries.
> >>
> >> Thanks for your thoughts.
> >>
> >>
> >>
> >> --
> >> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-sql
> >>
> > 
> 
> -- 
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
> 


Re: indexing longish string

From
Isaac Dover
Date:
Hi,
 
While hashing is certainly a good idea, you really should consider some issues well before you get to that point. Trust me, this could save you some headaches. First, though you're probably already aware, two XML documents can be the same document, but with very different literal representations. By the XML spec, these two DOMs would be equal (readers would treat them somewhat differently):
 
<aaa b="123" c="55" />
and
<aaa      c="55"       b="123"   ></aaa>
 
If you're looking to compare the strings literally, then you have to make sure that you canonicalize both the persisted string as well as the comparison in your queries. I first encountered this problem when storing XML as a vendor's data type of XML rather than varchar. Upon inserting, the database engine parsed the string into DOM for the XML data type so that XPath and XQuery statements could be processed. This is all fine and dandy until I retrieved the literal representation. The document <xxx></xxx> was transformed, as in the above example, to the literal <xxx />, but my select was comparing using the former string. Adding to the problem was that pretty much every tool that renders XML tends to hide these sort of differences in an effort to be "friendly" (save Vim, of course :). It took a bit of time for me to notice these small differences. Also keep in mind that document order, namespaces, and space normalization are other important considerations.
 
If you can store as DOM, you might experiment a bit to see if the process of canonicalizing is more/less efficient than just XPath'ing. Though, indexing a hash would probably be most efficient but at the risk of all that jibberish I just typed.
 
More info here:
 
Thanks, and good luck!
Isaac
 
On Tue, Nov 30, 2010 at 1:36 PM, Kenneth Marshall <ktm@rice.edu> wrote:
You can use a hash index for this. It's drawback is that it is not
yet WAL enabled and if your DB crashes you will need to rebuild the
index to fix the corruption. It works well(only) with equality
searches. If it is a scenario where you must have WAL, use a
function index based on the hash of the string.

Cheers,
Ken

Re: indexing longish string

From
Isaac Dover
Date:
No problem, sir, hopefully I could help. I failed to mention that I've discovered some bugs in the PostgreSQL 8.4 XML implementation that forced me to take pause and ultimately forego XML with Postgres. I haven't looked at 9 yet, but considering the current lack of interest and/or disdain so many have for XML, I'm not anticipating great improvements. Of course, this is not to knock the db - Pg is awesome. But I'm aware that we all have limited time and there are other more important features that far more people desire. So, you are probably better off using string literals with all of the known risks. I've used some lame hashing techniques that simply stripped out everything except [:alnum:] and then simply alphabetized the characters and sometimes sent them through SHA/MD5 to get a unique representation. It seemed to work fine for my purposes.
 
Or take the old, proven approach of shredding documents into normalized tables and serializing them back out when you need them. ... ick !! ...
 
Thanks,
Isaac

On Tue, Nov 30, 2010 at 2:33 PM, Rob Sargent <robjsargent@gmail.com> wrote:
Hello Issac,

Thanks a ton.

In this bizarre little corner, I'm pretty sure I would use a text field
rather than any xml-friendly datatype.  The xml content itself will be
trivial yet dynamic :).  We need just a little more rigour than a hacky
delimited string implementation, but not much.  And we guarantee the
repeatability of the string for a given parameter set.

We could also do a manual tag-value implementation, and likely would if
we thought we were ever going to ask questions like "everything with tag
foo value less than ten", but that's not on the table. (Today...)

On 11/30/2010 12:21 PM, Isaac Dover wrote:
> Hi,
>
> While hashing is certainly a good idea, you really should consider some
> issues well before you get to that point. Trust me, this could save you
> some headaches. First, though you're probably already aware, two XML
> documents can be the same document, but with very different literal
> representations. By the XML spec, these two DOMs would be equal (readers
> would treat them somewhat differently):
>
> <aaa b="123" c="55" />
> and
> <aaa      c="55"       b="123"   ></aaa>
>
> If you're looking to compare the strings literally, then you have to
> make sure that you canonicalize both the persisted string as well as the
> comparison in your queries. I first encountered this problem when
> storing XML as a vendor's data type of XML rather than varchar. Upon
> inserting, the database engine parsed the string into DOM for the XML
> data type so that XPath and XQuery statements could be processed. This
> is all fine and dandy until I retrieved the literal representation. The
> document <xxx></xxx> was transformed, as in the above example, to the
> literal <xxx />, but my select was comparing using the former
> string. Adding to the problem was that pretty much every tool that
> renders XML tends to hide these sort of differences in an effort to be
> "friendly" (save Vim, of course :). It took a bit of time for me to
> notice these small differences. Also keep in mind that document order,
> namespaces, and space normalization are other important considerations.
>
> If you can store as DOM, you might experiment a bit to see if the
> process of canonicalizing is more/less efficient than just XPath'ing.
> Though, indexing a hash would probably be most efficient but at the risk
> of all that jibberish I just typed.
>
> More info here:
> http://www.w3.org/TR/xml-c14n11/
> http://www.w3.org/TR/2010/WD-xml-c14n2-20100831/
>
> Thanks, and good luck!
> Isaac
>
> On Tue, Nov 30, 2010 at 1:36 PM, Kenneth Marshall <ktm@rice.edu
> <mailto:ktm@rice.edu>> wrote:
>
>     You can use a hash index for this. It's drawback is that it is not
>     yet WAL enabled and if your DB crashes you will need to rebuild the
>     index to fix the corruption. It works well(only) with equality
>     searches. If it is a scenario where you must have WAL, use a
>     function index based on the hash of the string.
>
>     Cheers,
>     Ken
>