Re: tsvector limitations - Mailing list pgsql-admin

From Tim
Subject Re: tsvector limitations
Date
Msg-id BANLkTiniXCCAdwD0qDXb3mqLSSQrzqKSgQ@mail.gmail.com
Whole thread Raw
In response to Re: tsvector limitations  ("Mark Johnson" <mark@remingtondatabasesolutions.com>)
List pgsql-admin
Mark,

That link is a mirror of this mailing list; it's not from 5 months ago.
If you are in the year 2012 please respond with lottery numbers and the like.



On Mon, Jun 13, 2011 at 9:43 PM, Mark Johnson <mark@remingtondatabasesolutions.com> wrote:
 

I found another post where you asked the same questions 5 months ago.  Have you tested in that time?  http://www.spinics.net/lists/pgsql-admin/msg19438.html


A text search vector is an array of distinct lexemes (less any stopwords) and their positions.  Taking your example we get ...

select to_tsvector('the lord of the rings.txt') "answer";
      answer
-------------------
'lord':2, 'rings.txt':5

You can put the length() function around it to just get the number of lexemes.  This is the size in terms of number of distinct lexemes, not size in terms of space utilization.

select length(to_tsvector('the lord of the rings.txt')) "answer";
  answer
--------
        2

You might find the tsvector data consumes 2x the space required by the input text.  It will depend on your configuration and your input data.  Test it and let us know what you find.

-Mark

-----Original Message-----
From: Tim [mailto:elatllat@gmail.com]
Sent: Monday, June 13, 2011 03:19 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] tsvector limitations

Dear list,

How big of a file would one need to fill the 1MB limit of a tsvector?
Reading http://www.postgresql.org/docs/9.0/static/textsearch-limitations.html seems to hint that filling a tsvector is improbable.

Is there an easy way of query the bytes of a tsvector?
something like length(tsvector) but bytes(tsvector).

If there no easy method to query the bytes of a tsvector
I realize the answer is highly dependent on the contents of the file, so here are 2 random examples:
How many bytes of a tsvector would a 32MB ascii english unique word list make?
How many bytes of a tsvector would something like "The Lord of the Rings.txt" make?

If this limitation is ever hit is there a common practice for using more than one tsvector?
Using a separate "one to many" table seems like an obvious solution piece,
but I would not know how to detect or calculate how much text to give each tsvector.
Assuming tsvectors can't be linked maybe they would need some overlap.


Thanks in advance.

pgsql-admin by date:

Previous
From: "Mark Johnson"
Date:
Subject: Re: tsvector limitations
Next
From: "Kevin Grittner"
Date:
Subject: Re: tsvector limitations