ltree and full text search questions - Mailing list pgsql-general

From Marc Mamin
Subject ltree and full text search questions
Date
Msg-id C4DAC901169B624F933534A26ED7DF31010A5093@JENMAIL01.ad.intershop.net
Whole thread Raw
List pgsql-general
Hello,

(sorry for this long mail)

I have started to evaluate ltree and tsearch (first on Windows with PG
Version 8.3.7) and I would apprecicate some clarification. The first
part deals with ltree where I have some questions, and the second part
is a concept to combine ltree with the full text search capabilities.
Here I'd like to get your opinion whether I'm on the right way or not.
Moreover I will probably need my own parser for the full text search and
I would be very gratefull if anybody could provide me with some C code
example that is similar to my requirement.

  and kudos for ltree and the full text search.  I'm impressed by their
performances :-)


I would like to use these two functionalities to store and analyze
paths, (the model beyond the data is not a tree, but a dense directed
graphs).


A) ltree
========

A ltree path would look like this: ...2_456.7_3425.1_23.9_231....
whereas each node is a combination of 2 information:
<class_id>_<item_id>

If my idea works well, I may also like to put some more information in
each node which would hence be a micropath of fix length (3 or 4)

With such a model, I could offer global path analysis on the <class>,
and drilldown possibilities on the <items>.

I came to this idea as the documentation of ltree says that a node is a
list of words separated by the '_' character.

a short example:

--drop  table paths ;
create table paths (id int, path ltree);
insert into paths(id,path)values(1, '1_11.2_13.3_10.4_13');
insert into paths(id,path)values(2, '1_12.4_15.3_11.4_10.15_14.1_11');
insert into paths(id,path)values(3, '1_11.2_13.3_10.4_10');
insert into paths(id,path)values(4, '1_12.4_15.3_11.3_10.13_14.13_14');
insert into paths(id,path)values(5, '1_11.2_13.3_10.2_13');
insert into paths(id,path)values(6, '1_12.4_15.3_11.1_10.12_14.1_11');
insert into paths(id,path)values(7, '127_1235');

--now I can e.g retrieve all items that have a path from <class 2> to
<class 4> using the prefix matching:

select id from paths where path ~ '*.2_*.*.4_*.*'::lquery -->(1 & 3)
--drop  table paths ;

And now my questions and comments:

1)

From the module description, I first though that the '_' character had a
special meaning
but it just seems to be an extra allowed character beside  [a-zA-Z0-9].
Am I correct or is '_' defined internally as separator for
indexing/searching ltree data ?


2)
The documentation says that the length of a label path is limited to
65kB. This is the max number of nodes, and not the size of the string.
Moreover, when you try to define a larger path, it will be silently cut
without notice or error (seems that the first 65kB are just dropped, but
I'm not sure about this).

Maybe it would be a good thing to add this to the documentation.


3)

Gist index

create index path_gist_idx on paths using gist (path);
=> ERROR: index row requires 621840 bytes, maximum size is 8191

So it seems that gist indexes can only be used for paths that do not
exceed 8kB in size,
which is much less than the 65kB limit in depth.

Is this correct or am I missing a point ?
(I also have PostGIS 1.3.5 installed. Could this be an issue ?)


B) Full text search
===================

ltree offers a prefix search on the nodes, but no suffix search which I
need to look for given <item_id>.
So my idea is to combine ltree with full text search.

If a node has a fix format  like  <class>_<item>, I could use full text
search indexing to look for paths containing a given item.

My first step was to check the default parser which works really badly
in this case:

SELECT alias, description, token FROM ts_debug('1_12.4_15.3_11.1_10');

alias         description         token
-----        -----------        -----
uint        Unsigned integer    1
blank        Space symbols        _
float        Decimal notation    12.4
blank        Space symbols        _
float        Decimal notation    15.3
blank        Space symbols        _
float        Decimal notation    11.1
blank        Space symbols        _
uint        Unsigned integer    10


So I guess I need my own parser which should return something like:
(this would be a specific parser that had to know the exact pattern and
meaning of a node)

alias     description     token
-----        -----------        -----
class        class_id        1
blank        Space symbols    _
item        item_id        12
blank        Space symbols    .
class        class_id        4
blank        Space symbols    _
item        item_id        15
blank        Space symbols    .
class        class_id        3

As already said, I would be very gratefull if anybody could provide me
with some C code example that is similar to my requirement.

Having this, I would then define different search configuration:

CREATE TEXT SEARCH CONFIGURATION ltclass;
CREATE TEXT SEARCH CONFIGURATION ltitem;

ALTER TEXT SEARCH CONFIGURATION ltclass DROP MAPPING FOR ltitem;
ALTER TEXT SEARCH CONFIGURATION ltitem DROP MAPPING FOR ltclass;

and build different search indexes for each kind of data contained in a
node:

CREATE INDEX path_class_ix ON paths USING gin(to_tsvector('ltclass',
ltree2text(path)));
CREATE INDEX path_item_ix  ON paths USING gin(to_tsvector('ltitem',
ltree2text(path)));


Finally using both ltree and full text search would provides a very
powerfull framework for path analysis where a node is not only a single
reference,
but a combination of some pieces of information :-)


And a last question: the alias below exists only in the parser.
I can't use it later on to restrict a search to token of a given alias.
Right ?



Thank you for your attention.

Marc Mamin



pgsql-general by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Weird encoding behavior
Next
From: "Albe Laurenz"
Date:
Subject: Re: Weird encoding behavior