I broached this topic last year[1], but the project got tabled until
now; so I raise it again. We want to be able to search text
(extracted from character-based PDF files) which will contain legal
terms and statute cites, and we want to be able to do tsearch2
searches (under 8.3.recent). It's clear enough how to create a
dictionary to gracefully handle the legal terms, but I'm less sure
about the statute cites.
I got one response[2], which mentioned a prefix search in the 8.4
release, and provided a link to a perl regular expression based
dictionary. I'm wondering if anyone has feedback one either of these
techniques, and whether they might work for our needs. I'm not sure I
adequately described our needs, so I'll fill that out a little more.
People are likely to search for statute cites, which tend to have a
hierarchical form. I'm not sure the prefix approach will work for
this. For example, there is a section 939.64 in the state statutes
dealing with commission of a crime while wearing a bulletproof
garment. If someone searches for that, they should find subsections
like 939.64(1) or 939.64(2) but not different sections which start
with the same characters like 939.641 (the section on concealing
identity) or 939.645 (the section on hate crimes). A search for
chapter 939 should return any of the above.
Of course, we want someone to be able to search on 939.64, 939.641,
and 939.645 and get documents which reference all of the above (i.e.,
to look for a document referring to a hate crime committed while
concealing identity and wearing a bulletproof garment).
Suggestions welcome on how to handle this user requirement.
-Kevin
[1] http://archives.postgresql.org/pgsql-admin/2008-06/msg00033.php
[2] http://archives.postgresql.org/pgsql-admin/2008-06/msg00034.php