Thread: phraseto_tsquery design
I recently migrated over from SQL Server to Postgres (through Amazon Aurora) for the cost.
I have been working with the full-text search functions for quite some time, but have encountered a few major roadblocks in my quest to use postgres full-text search functionality to its fullest. I am using it as a blob text-search tool, where the entire document (could be up to 100 pages) is indexed and searched on:
1. Limited table and document sizes. In SQL server, I can create a 1.5 TB table no problem (filingid, data blobs). However, in postgres, this is not possible - I am limited to approximately 100-200GB per table and 1MB per tsvector. This is an annoyance, but can get around it.
2. However, it seems impossible to bypass the performance problem of phrase searching. I conduct quite a bit of phrase searching, and although postgres' "phraseto_tsquery" performs great on phrases with uncommon words, it slows to a screeching halt on phrases with common words such as "law firm" or, for example, "bank of america". This is a huge problem, because "plainto_tsquery" performs just fine on these but as I understand it, phrase searching is built to do a scan after finding each word using "plainto"?
There are already positions and the "plainto" function is quite fast; is there a way to modify the "phraseto" query to perform a useful and fast search that looks for the distance between found words appropriately?
Seems fairly trivial to modify the function this way, but I didn't see it on the development roadmap. It's a pretty critical piece and I have seen that other people have had issues with it too, and a test between this function and some other similar functions show that the performance of the phraseto function is quite poor.
Thanks,
Sagiv
Hello, On Thu, Jun 21, 2018 at 11:02:32AM -0400, Sagiv Some wrote: > 2. However, it seems impossible to bypass the performance problem of phrase > searching. I conduct quite a bit of phrase searching, and although > postgres' "phraseto_tsquery" performs great on phrases with uncommon words, > it slows to a screeching halt on phrases with common words such as "law > firm" or, for example, "bank of america". This is a huge problem, because > "plainto_tsquery" performs just fine on these but as I understand it, > phrase searching is built to do a scan after finding each word using > "plainto"? > > There are already positions and the "plainto" function is quite fast; is > there a way to modify the "phraseto" query to perform a useful and fast > search that looks for the distance between found words appropriately? If I understood you correctly you use GIN index for text search. Unfortunately it isn't phraseto_tsquery() function issue. It is GIN index characteristic. tsvector consists of lexemes and their positions retreived from text document. GIN has only lexems, and it is OK for regular search (using plainto_tsquery() function). But phrase search (via phraseto_tsquery()) requires lexem positions. During phrase search additional work is made: - first get all items from index which satisfy the query (as in reqular search) - then read entire tsvector from the heap - recheck all got items and exclude those of them which don't satisfy the phrase query Last two point is additional work. We have our index as an extension. It is changed GIN index and can store lexemes and their positions. And therefore phrase queries are faster. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company