Thread: Ways to speed up ts_rank
Hello, First let me say thanks for a fantastic database system. The hard work this community has put into Postgres really shows. A client of mine has been using Postgres quite effectively for a while now, but has recently hit a performance issue with text index queries, specifically when using ts_rank or ts_rank_cd functions. The database has a text index of around 200,000 documents. Investigation revealed that text queries are slow only when using ts_rank or ts_rank_cd. Without a ts_rank function, any query is answered within 200ms or so; with ts_rank function, queries take up to 30 seconds. Deeper investigation using gprof showed that the problem is probably not ts_rank or ts_rank_cd, but the fact that those functions retrieve thousands of TOASTed tsvectors. I estimate that the total size of the text index is 900 MB uncompressed (including the TOAST data). It is a table that stores only a docid, a tsvector, and a couple other small fields. Some queries cause ts_rank to retrieve 700 MB of that 900 MB. That means each time a user types a simple query, Postgres retrieves and decompresses almost 700 MB of TOAST data. That's the size of a CD. No wonder it sometimes takes so long! I tried using SET STORAGE to optimize access to the TOAST tuples, but that only made things worse. Even with EXTERNAL storage, Postgres still has to copy the TOAST data in RAM and that takes time. I also recompiled Postgres with a larger TOAST chunk size. That did improve performance by about 20%, but I'm looking for a much bigger win than that. The tsvectors are large because the documents are of variable quality. We can not rely on authors to provide good metadata; we really need the text index to just index everything in the documents. I've seen how ts_rank can do its job with millions of documents in milliseconds, but that only works when the text index stores a small amount of data per document. Short metadata just won't work for my client. Also, my client can not simply stop using ts_rank. I've thought of some possible solutions to speed up ts_rank: 1. Cache the whole tsvectors in RAM. This would be different from the buffer cache, which apparently only knows how to cache the chunks of a TOASTed tsvector. I have thought of 3 ways to do this: A. Write a C extension that maintains a cache of tsvectors in shared memory. I think I would use each document's hash as the cache key. B. Add to Postgres the ability to cache any decompressed TOAST value. This would probably involve expanding the SET STORAGE clause of ALTER TABLE so that adminstrators can configure which TOAST values are worth caching. C. Alter the buffer cache to be more aware of TOAST; make it cache whole TOAST values rather than chunks. This would be invasive and might be a bad idea overall. 2. Maintain 2 tsvectors per row: one for querying, another for ranking. The tsvector used for ranking would be trimmed to prevent TOASTing. This would speed up queries, but it may impact the quality of ranking. 3. Write a C extension that stores the tsvectors in memory-mapped files on disk, allowing us to at least take advantage of kernel-level caching. What do you think? Do one of these solutions pop out to you as a good idea, or have I overlooked some simpler solution? Shane
Le 2012-10-09 à 17:38, Shane Hathaway a écrit : > Hello, > > The database has a text index of around 200,000 documents. Investigation revealed that text queries are slow only whenusing ts_rank or ts_rank_cd. Without a ts_rank function, any query is answered within 200ms or so; with ts_rank function,queries take up to 30 seconds. Deeper investigation using gprof showed that the problem is probably not ts_rankor ts_rank_cd, but the fact that those functions retrieve thousands of TOASTed tsvectors. Is the query perhaps doing something like this: SELECT ... FROM table WHERE tsvectorcol @@ plainto_tsquery('...') ORDER BY ts_rank(...) If so, ts_rank() is run for every document. What you should do instead is: SELECT * FROM ( SELECT ... FROM table WHERE tsvectorcol @@ plainto_tsquery('...')) AS t1 ORDER BY ts_rank(...) Notice the ts_rank() is on the outer query, which means it'll only run on the subset of documents which match the query.This is explicitly mentioned in the docs: """Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound andtherefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbersof matches.""" (last paragraph of) http://www.postgresql.org/docs/current/static/textsearch-controls.html#TEXTSEARCH-RANKING Hope that helps! François Beausoleil
We'll present in Prague some improvements in FTS. Unfortunately, we have only several minutes during lighting talk. In short, we improved GIN to store additional information, coordinates for fts, for example and return ordered by rank search results, which gave us performance better than sphynx. It's just a prototype, but we already got median at 8 msec for 6 mln classifieds. We didn't tested for long documents yet. Regards, Oleg On Wed, 10 Oct 2012, Fran?ois Beausoleil wrote: > > Le 2012-10-09 ? 17:38, Shane Hathaway a ?crit : > >> Hello, >> >> The database has a text index of around 200,000 documents. Investigation revealed that text queries are slow only whenusing ts_rank or ts_rank_cd. Without a ts_rank function, any query is answered within 200ms or so; with ts_rank function,queries take up to 30 seconds. Deeper investigation using gprof showed that the problem is probably not ts_rankor ts_rank_cd, but the fact that those functions retrieve thousands of TOASTed tsvectors. > > Is the query perhaps doing something like this: > > SELECT ... > FROM table > WHERE tsvectorcol @@ plainto_tsquery('...') > ORDER BY ts_rank(...) > > If so, ts_rank() is run for every document. What you should do instead is: > > SELECT * > FROM ( > SELECT ... > FROM table > WHERE tsvectorcol @@ plainto_tsquery('...')) AS t1 > ORDER BY ts_rank(...) > > Notice the ts_rank() is on the outer query, which means it'll only run on the subset of documents which match the query.This is explicitly mentioned in the docs: > > """Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O boundand therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbersof matches.""" > > (last paragraph of) http://www.postgresql.org/docs/current/static/textsearch-controls.html#TEXTSEARCH-RANKING > > Hope that helps! > Fran?ois Beausoleil > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On 10/10/2012 06:38 AM, François Beausoleil wrote: > Notice the ts_rank() is on the outer query, which means it'll only > run on the subset of documents which match the query. This is > explicitly mentioned in the docs: > > """Ranking can be expensive since it requires consulting the tsvector > of each matching document, which can be I/O bound and therefore slow. > Unfortunately, it is almost impossible to avoid since practical > queries often result in large numbers of matches.""" > > (last paragraph of) > http://www.postgresql.org/docs/current/static/textsearch-controls.html#TEXTSEARCH-RANKING Indeed, I have studied that paragraph in depth, trying to gather as much possible meaning from it as I can. :-) However, the following two queries take exactly the same time, suggesting to me that ts_rank_cd is really only looking at matching rows, not all rows: SELECT docid, coefficient * ts_rank_cd('{0.1, 0.2, 0.5, 1.0}', text_vector, to_tsquery('english', 'stuff')) AS rank FROM pgtextindex WHERE (text_vector @@ to_tsquery('english', 'stuff')) ORDER BY rank DESC limit 3; SELECT docid, coefficient * ts_rank_cd('{0.1, 0.2, 0.5, 1.0}', text_vector, to_tsquery('english', 'stuff')) AS rank FROM (SELECT * FROM pgtextindex WHERE (text_vector @@ to_tsquery('english', 'stuff'))) AS filtered ORDER BY rank DESC limit 3; Thanks for the suggestion though. By the way, all the tsvectors are already loaded into the kernel cache when I execute the queries, so ranking large documents is in fact CPU bound rather than I/O bound. The CPU is pegged for the whole time. Shane
On 10/10/2012 08:59 AM, Oleg Bartunov wrote: > We'll present in Prague some improvements in FTS. Unfortunately, we have > only several minutes during lighting talk. In short, we improved GIN to > store additional information, coordinates for fts, for example and > return ordered by rank search results, which gave us performance better > than > sphynx. It's just a prototype, but we already got median at 8 msec for 6 > mln classifieds. > > We didn't tested for long documents yet. That sounds like the solution I'm looking for! Storing the info for ts_rank in an index would probably do the trick. Can I get this code somewhere? I've isolated the table I need to optimize and I can easily run tests on non-production code. Shane