Thread: Unique values across a table of arrays - documents and tags

Unique values across a table of arrays - documents and tags

From
Ivan Voras
Date:
Hello,

I know I need to re-engineer this so it doesn't suck by design, so I'm
wondering if there is some nifty PostgreSQL feature or best practice
which may automagically do the best thing.

I store information about documents which are tagged by string tags. The
structure is very simple:

CREATE TABLE documents (
        id SERIAL NOT NULL,
        title TEXT NOT NULL,
        -- other fields --
        tags TEXT[] NOT NULL,
        flags INTEGER
);

Currently, I have a GIN index on the tags field, and it works for searching:

edem=> explain analyze select id,title,flags from documents where tags
@> ARRAY['tag'];
                                                      QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on documents  (cost=8.00..12.01 rows=1 width=39)
(actual time=0.067..0.086 rows=9 loops=1)
   Recheck Cond: (tags @> '{tag}'::text[])
   ->  Bitmap Index Scan on documents_tags  (cost=0.00..8.00 rows=1
width=0) (actual time=0.053..0.053 rows=9 loops=1)
         Index Cond: (tags @> '{tag}'::text[])
 Total runtime: 0.135 ms
(5 rows)


The other feature I need is a list of unique tags in all the documents,
e.g.:

edem=> explain analyze select distinct unnest(tags) as tag from documents;
                                                 QUERY PLAN

-------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=28.54..28.84 rows=24 width=42) (actual
time=0.261..0.307 rows=44 loops=1)
   ->  Seq Scan on documents  (cost=0.00..28.45 rows=36 width=42)
(actual time=0.020..0.157 rows=68 loops=1)
 Total runtime: 0.419 ms
(3 rows)

This is unfortunately slow (because I know the load will increase and
this will be a common operation).

The thing I was planning to do is create a separate table, with only the
unique tags, and possibly an array of documents which have these tags,
which will be maintained with UPDATE and INSERT triggers on the
documents table, but then I remembered that the GIN index itself does
something not unlike this method. Is there a way to make use of this
information to get a list of unique tags?

Barring that, what would you suggest for efficiently handing a classic
structure like this (meaning documents with tags)?


Attachment

Re: Unique values across a table of arrays - documents and tags

From
François Beausoleil
Date:
Le 2012-11-07 à 10:21, Ivan Voras a écrit :

>
> This is unfortunately slow (because I know the load will increase and
> this will be a common operation).
>
> The thing I was planning to do is create a separate table, with only the
> unique tags, and possibly an array of documents which have these tags,
> which will be maintained with UPDATE and INSERT triggers on the
> documents table, but then I remembered that the GIN index itself does
> something not unlike this method. Is there a way to make use of this
> information to get a list of unique tags?
>
> Barring that, what would you suggest for efficiently handing a classic
> structure like this (meaning documents with tags)?
>

Can you structure it as the "classic" many to many pattern:

documents <-> taggings <-> tags

Unique tags then becomes a plain seq scan on a smallish table (tags). To keep the ability to have a single field, you
canhide the documents table behind a view that would do an array_agg, such as: 

SELECT documents.*, array_agg(taggings.tag)
FROM documents JOIN tags ON tags.document_id = documents.id
GROUP BY documents.*

Not sure we can do GROUP BY documents.*, but if not, you list your columns individually.

Hope that helps!
François

Re: Unique values across a table of arrays - documents and tags

From
Ivan Voras
Date:
On 07/11/2012 16:34, François Beausoleil wrote:
> Le 2012-11-07 à 10:21, Ivan Voras a écrit :

>> Barring that, what would you suggest for efficiently handing a classic
>> structure like this (meaning documents with tags)?
>
> Can you structure it as the "classic" many to many pattern:
>
> documents <-> taggings <-> tags

Yes, that is as you said, a classic solution to a classic problem :)

If needed, this is the way I will do it, but for now I'm asking if
there's something that can be done to avoid creating another table or
two. The reason I'm asking is that I've often found that PostgreSQL can
do much more than I thought it can.




Attachment

Re: Unique values across a table of arrays - documents and tags

From
Florent Guillaume
Date:
Maybe you could store the tags as fulltext words, query them using
fulltext search, and use ts_stat to gather the list of words? Needs to
be benched of course.
You'll probably need to change the config to avoid stemming and stop words.

Florent

On Wed, Nov 7, 2012 at 4:38 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 07/11/2012 16:34, François Beausoleil wrote:
>> Le 2012-11-07 à 10:21, Ivan Voras a écrit :
>
>>> Barring that, what would you suggest for efficiently handing a classic
>>> structure like this (meaning documents with tags)?
>>
>> Can you structure it as the "classic" many to many pattern:
>>
>> documents <-> taggings <-> tags
>
> Yes, that is as you said, a classic solution to a classic problem :)
>
> If needed, this is the way I will do it, but for now I'm asking if
> there's something that can be done to avoid creating another table or
> two. The reason I'm asking is that I've often found that PostgreSQL can
> do much more than I thought it can.
>
>
>



--
Florent Guillaume, Director of R&D, Nuxeo
Open Source, Java EE based, Enterprise Content Management (ECM)
http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87


Re: Unique values across a table of arrays - documents and tags

From
Ivan Voras
Date:
On 07/11/2012 16:54, Florent Guillaume wrote:
> Maybe you could store the tags as fulltext words, query them using
> fulltext search, and use ts_stat to gather the list of words? Needs to
> be benched of course.
> You'll probably need to change the config to avoid stemming and stop words.

I have thought of that but decided not to even try because fts will also
use GIN and the combination of fts on a text field + GIN cannot possibly
be faster than just GIN on an array...


Attachment

Re: Unique values across a table of arrays - documents and tags

From
Florent Guillaume
Date:
On Wed, Nov 7, 2012 at 5:00 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 07/11/2012 16:54, Florent Guillaume wrote:
>> Maybe you could store the tags as fulltext words, query them using
>> fulltext search, and use ts_stat to gather the list of words? Needs to
>> be benched of course.
>> You'll probably need to change the config to avoid stemming and stop words.
>
> I have thought of that but decided not to even try because fts will also
> use GIN and the combination of fts on a text field + GIN cannot possibly
> be faster than just GIN on an array...

Unless ts_stat itself is optimized to reach inside the index to fetch
its statistics, but I don't think that's the case.

Florent

--
Florent Guillaume, Director of R&D, Nuxeo
Open Source, Java EE based, Enterprise Content Management (ECM)
http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87