Thread: Looking for tips on improving full-text search quality in Postgres
Hi all - When I have a need for both sophisticated database querying and full-text search, I'd rather not stand up a technology stackwith multiple tools (e.g., Postgres and Apache Solr, or Postgres and ElasticSearch with a zomboDB bridge). So I've beenlooking at the Postgres full-text search capability, and comparing it to Apache Solr. My experience so far - which hasnot been entirely anecdotal, but hasn't amounted to a formal TREC-style evaluation - is that Postgres full-text search,in any ranking/normalization configuration I can create, is reliably worse than Solr. Now, I understand that the wholepoint of Solr is search, and this is a sideline for Postgres, but I'd like to figure out how close Postgres can get,and while I'm knowledgeable about search technologies, I'm not an expert. And I've looked for information on the Webabout comparing Postgres search to other search capabilities, and everything I've found so far is extremely basic. Does anybody have any pointers to resources (people, sites, journal articles, blogs, etc.) which are deeply knowledgeableabout this comparison? Thanks in advance - Sam Bayer The MITRE Corporation sam@mitre.org
On Fri, Mar 4, 2022 at 08:10:48AM -0500, Bayer, Samuel wrote: > Hi all - > > When I have a need for both sophisticated database querying and > full-text search, I'd rather not stand up a technology stack with > multiple tools (e.g., Postgres and Apache Solr, or Postgres and > ElasticSearch with a zomboDB bridge). So I've been looking at the > Postgres full-text search capability, and comparing it to Apache > Solr. My experience so far - which has not been entirely anecdotal, > but hasn't amounted to a formal TREC-style evaluation - is that > Postgres full-text search, in any ranking/normalization configuration > I can create, is reliably worse than Solr. Now, I understand that the > whole point of Solr is search, and this is a sideline for Postgres, > but I'd like to figure out how close Postgres can get, and while I'm > knowledgeable about search technologies, I'm not an expert. And I've > looked for information on the Web about comparing Postgres search > to other search capabilities, and everything I've found so far is > extremely basic. > > Does anybody have any pointers to resources (people, sites, journal > articles, blogs, etc.) which are deeply knowledgeable about this > comparison? Uh, most of our full text seach is done by Russian developers, who are obviously very good at it. It would be helpful if you could list exactly what is missing and then we can have a discussion the hackers list to see what is possible. I think it would be helpful if we just document what we _don't_ have. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
"Bayer, Samuel"
Date:
Thanks for replying. My problem is that I can't provide enough guidance on what isn't working, because (a) I don't have goodenough intuitions about how the normalization options are expected to affect the results, and (b) I can't identify aspecific missing function - I'm just observing that I can't make the results as high-quality as Solr. My apologies. Sam On 3/4/22 10:25 AM, Bruce Momjian wrote: > On Fri, Mar 4, 2022 at 08:10:48AM -0500, Bayer, Samuel wrote: >> Hi all - >> >> When I have a need for both sophisticated database querying and >> full-text search, I'd rather not stand up a technology stack with >> multiple tools (e.g., Postgres and Apache Solr, or Postgres and >> ElasticSearch with a zomboDB bridge). So I've been looking at the >> Postgres full-text search capability, and comparing it to Apache >> Solr. My experience so far - which has not been entirely anecdotal, >> but hasn't amounted to a formal TREC-style evaluation - is that >> Postgres full-text search, in any ranking/normalization configuration >> I can create, is reliably worse than Solr. Now, I understand that the >> whole point of Solr is search, and this is a sideline for Postgres, >> but I'd like to figure out how close Postgres can get, and while I'm >> knowledgeable about search technologies, I'm not an expert. And I've >> looked for information on the Web about comparing Postgres search >> to other search capabilities, and everything I've found so far is >> extremely basic. >> >> Does anybody have any pointers to resources (people, sites, journal >> articles, blogs, etc.) which are deeply knowledgeable about this >> comparison? > > Uh, most of our full text seach is done by Russian developers, who are > obviously very good at it. It would be helpful if you could list > exactly what is missing and then we can have a discussion the hackers > list to see what is possible. I think it would be helpful if we just > document what we _don't_ have. >
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
Atri Sharma
Date:
Can you define what "high quality" is? Are you referring to precision? Or recall? Or speed? Or query dialect? On Fri, Mar 4, 2022 at 8:59 PM Bayer, Samuel <sam@mitre.org> wrote: > > Thanks for replying. My problem is that I can't provide enough guidance on what isn't working, because (a) I don't havegood enough intuitions about how the normalization options are expected to affect the results, and (b) I can't identifya specific missing function - I'm just observing that I can't make the results as high-quality as Solr. > > My apologies. > > Sam > > On 3/4/22 10:25 AM, Bruce Momjian wrote: > > On Fri, Mar 4, 2022 at 08:10:48AM -0500, Bayer, Samuel wrote: > >> Hi all - > >> > >> When I have a need for both sophisticated database querying and > >> full-text search, I'd rather not stand up a technology stack with > >> multiple tools (e.g., Postgres and Apache Solr, or Postgres and > >> ElasticSearch with a zomboDB bridge). So I've been looking at the > >> Postgres full-text search capability, and comparing it to Apache > >> Solr. My experience so far - which has not been entirely anecdotal, > >> but hasn't amounted to a formal TREC-style evaluation - is that > >> Postgres full-text search, in any ranking/normalization configuration > >> I can create, is reliably worse than Solr. Now, I understand that the > >> whole point of Solr is search, and this is a sideline for Postgres, > >> but I'd like to figure out how close Postgres can get, and while I'm > >> knowledgeable about search technologies, I'm not an expert. And I've > >> looked for information on the Web about comparing Postgres search > >> to other search capabilities, and everything I've found so far is > >> extremely basic. > >> > >> Does anybody have any pointers to resources (people, sites, journal > >> articles, blogs, etc.) which are deeply knowledgeable about this > >> comparison? > > > > Uh, most of our full text seach is done by Russian developers, who are > > obviously very good at it. It would be helpful if you could list > > exactly what is missing and then we can have a discussion the hackers > > list to see what is possible. I think it would be helpful if we just > > document what we _don't_ have. > > > > -- Regards, Atri l'apprenant
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
"Bayer, Samuel"
Date:
Fair question. Not worried so much about speed. Looking, essentially, at precision by rank (i.e., average precision and variants).I have not explored the contrasts between the default English language configuration in Postgres and the one inSolr - I have no reason to believe that there's anything odd going on there. My problem is that I can't provide specificperformance numbers, or the corpus in question, but my overall impression is that the top N (10, 20) results fromPostgres, no matter how I configure the ranking, aren't as relevant to the query, as a group, than the ones from Solr. Example anecdote: the documents I'm searching come with metadata (e.g., title), which I'm not indexing specially (not a separatefield, just part of the raw text of the document). When I search even for single terms, and look at the titles ofthe results, the titles in the Solr results more frequently contain that term than the titles in the Postgres results.I also FEEL like I've noticed that the problem is more apparent in "OR" queries; if I search for a disjunction ofterms, the documents that contain all the terms are more likely to be high in the Solr rankings than in the Postgres rankings. I apologize for not being able to be more specific. Thanks in advance, again. On 3/4/22 10:30 AM, Atri Sharma wrote: > Can you define what "high quality" is? > > Are you referring to precision? Or recall? Or speed? Or query dialect? > > On Fri, Mar 4, 2022 at 8:59 PM Bayer, Samuel <sam@mitre.org> wrote: >> >> Thanks for replying. My problem is that I can't provide enough guidance on what isn't working, because (a) I don't havegood enough intuitions about how the normalization options are expected to affect the results, and (b) I can't identifya specific missing function - I'm just observing that I can't make the results as high-quality as Solr. >> >> My apologies. >> >> Sam >> >> On 3/4/22 10:25 AM, Bruce Momjian wrote: >>> On Fri, Mar 4, 2022 at 08:10:48AM -0500, Bayer, Samuel wrote: >>>> Hi all - >>>> >>>> When I have a need for both sophisticated database querying and >>>> full-text search, I'd rather not stand up a technology stack with >>>> multiple tools (e.g., Postgres and Apache Solr, or Postgres and >>>> ElasticSearch with a zomboDB bridge). So I've been looking at the >>>> Postgres full-text search capability, and comparing it to Apache >>>> Solr. My experience so far - which has not been entirely anecdotal, >>>> but hasn't amounted to a formal TREC-style evaluation - is that >>>> Postgres full-text search, in any ranking/normalization configuration >>>> I can create, is reliably worse than Solr. Now, I understand that the >>>> whole point of Solr is search, and this is a sideline for Postgres, >>>> but I'd like to figure out how close Postgres can get, and while I'm >>>> knowledgeable about search technologies, I'm not an expert. And I've >>>> looked for information on the Web about comparing Postgres search >>>> to other search capabilities, and everything I've found so far is >>>> extremely basic. >>>> >>>> Does anybody have any pointers to resources (people, sites, journal >>>> articles, blogs, etc.) which are deeply knowledgeable about this >>>> comparison? >>> >>> Uh, most of our full text seach is done by Russian developers, who are >>> obviously very good at it. It would be helpful if you could list >>> exactly what is missing and then we can have a discussion the hackers >>> list to see what is possible. I think it would be helpful if we just >>> document what we _don't_ have. >>> >> >> > >
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
Bruce Momjian
Date:
On Fri, Mar 4, 2022 at 10:41:16AM -0500, Bayer, Samuel wrote: > Example anecdote: the documents I'm searching come with metadata > (e.g., title), which I'm not indexing specially (not a separate field, > just part of the raw text of the document). When I search even for > single terms, and look at the titles of the results, the titles in the > Solr results more frequently contain that term than the titles in the > Postgres results. I also FEEL like I've noticed that the problem is > more apparent in "OR" queries; if I search for a disjunction of terms, > the documents that contain all the terms are more likely to be high in > the Solr rankings than in the Postgres rankings. > > I apologize for not being able to be more specific. I know it is hard to quantify. Is it possible that Postgres is treating all the terms equally, while Solr is prioritizing terms that are earlier in the document? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes: > On Fri, Mar 4, 2022 at 10:41:16AM -0500, Bayer, Samuel wrote: >> I apologize for not being able to be more specific. > I know it is hard to quantify. Is it possible that Postgres is treating > all the terms equally, while Solr is prioritizing terms that are earlier > in the document? A few basic questions: * which ranking function are you using? * with what options? * which PG version exactly? As far as I can see from a quick look at the docs, neither ts_rank() nor ts_rank_cd() consider "earlier in the document" to be an interesting consideration. They do have the ability to prefer terms that have been marked as having a higher weight, but you'd need to do some setup work to make that useful --- basically, you'd have to separate out the title or other metadata and apply setweight() to it while building the tsvectors. I wouldn't be surprised if Solr has some well-tuned default heuristics that mean that you don't have to work hard to get good results from it. The current state of our FTS features is more like "here's all the parts, but you have to build the behavior you want". ISTM that our FTS features have basically been on autopilot since they went in. I'd sort of hoped that we'd see more parsers, more ranking functions, etc, over time ... but nothing like that has happened. I'm not sure if that's just lack of interest, or if people find the code too difficult to work with. regards, tom lane
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
"Bayer, Samuel"
Date:
I've tried both ranking functions. I've tried a variety of the normalization settings. I'm using the standard English languageconfiguration. Postgres 13. I do understand your FTS philosophy - I suppose I'm looking for guidance about how best to approximate the search capabilityin Solr using the FTS pieces you have. One concrete question, I suppose, is: the classic TF/IDF search strategyrelies on inverse document frequency, which looks across the corpus. I can't tell whether that corpus-wide frequencyinformation is taken into account in either ranking function. I don't know if Solr weights earlier tokens more heavily, but I wouldn't be surprised if it does. On 3/4/22 11:09 AM, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: >> On Fri, Mar 4, 2022 at 10:41:16AM -0500, Bayer, Samuel wrote: >>> I apologize for not being able to be more specific. > >> I know it is hard to quantify. Is it possible that Postgres is treating >> all the terms equally, while Solr is prioritizing terms that are earlier >> in the document? > > A few basic questions: > > * which ranking function are you using? > > * with what options? > > * which PG version exactly? > > As far as I can see from a quick look at the docs, neither > ts_rank() nor ts_rank_cd() consider "earlier in the document" > to be an interesting consideration. They do have the ability > to prefer terms that have been marked as having a higher weight, > but you'd need to do some setup work to make that useful --- > basically, you'd have to separate out the title or other metadata > and apply setweight() to it while building the tsvectors. > > I wouldn't be surprised if Solr has some well-tuned default > heuristics that mean that you don't have to work hard to get > good results from it. The current state of our FTS features > is more like "here's all the parts, but you have to build the > behavior you want". > > ISTM that our FTS features have basically been on autopilot > since they went in. I'd sort of hoped that we'd see more > parsers, more ranking functions, etc, over time ... but nothing > like that has happened. I'm not sure if that's just lack of > interest, or if people find the code too difficult to work with. > > regards, tom lane
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
Tom Lane
Date:
"Bayer, Samuel" <sam@mitre.org> writes: > One concrete question, I suppose, is: the classic TF/IDF search strategy relies on inverse document frequency, which looksacross the corpus. I can't tell whether that corpus-wide frequency information is taken into account in either rankingfunction. The documentation is pretty clear that they don't, they just consider each document in isolation. Building a structure that would allow more-global info to be taken into account is an interesting project that nobody's tackled. regards, tom lane
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
Bruce Momjian
Date:
On Fri, Mar 4, 2022 at 11:39:39AM -0500, Bayer, Samuel wrote: > I've tried both ranking functions. I've tried a variety of the > normalization settings. I'm using the standard English language > configuration. Postgres 13. > > I do understand your FTS philosophy - I suppose I'm looking for > guidance about how best to approximate the search capability in Solr > using the FTS pieces you have. One concrete question, I suppose, > is: the classic TF/IDF search strategy relies on inverse document > frequency, which looks across the corpus. I can't tell whether that > corpus-wide frequency information is taken into account in either > ranking function. > > I don't know if Solr weights earlier tokens more heavily, but I > wouldn't be surprised if it does. Sorry, I don't know what corpus is or what you are asking? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
Bruce Momjian
Date:
On Fri, Mar 4, 2022 at 11:43:57AM -0500, Tom Lane wrote: > "Bayer, Samuel" <sam@mitre.org> writes: > > One concrete question, I suppose, is: the classic TF/IDF search strategy relies on inverse document frequency, whichlooks across the corpus. I can't tell whether that corpus-wide frequency information is taken into account in eitherranking function. > > The documentation is pretty clear that they don't, they just consider each > document in isolation. Building a structure that would allow more-global > info to be taken into account is an interesting project that nobody's > tackled. Oh, you mean like, "pizza" is used only in this document vs it being used in every document. Should we add some documentation about this or is it already clear? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
Atri Sharma
Date:
TF/IDF should be pretty simple to implement IMO.
And no, Solr does not give preference to prior documents.
However, Solr allows you to "boost" specific terms, thus creating the impression of preference.
On Fri, 4 Mar 2022, 22:15 Bruce Momjian, <bruce@momjian.us> wrote:
On Fri, Mar 4, 2022 at 11:43:57AM -0500, Tom Lane wrote:
> "Bayer, Samuel" <sam@mitre.org> writes:
> > One concrete question, I suppose, is: the classic TF/IDF search strategy relies on inverse document frequency, which looks across the corpus. I can't tell whether that corpus-wide frequency information is taken into account in either ranking function.
>
> The documentation is pretty clear that they don't, they just consider each
> document in isolation. Building a structure that would allow more-global
> info to be taken into account is an interesting project that nobody's
> tackled.
Oh, you mean like, "pizza" is used only in this document vs it being
used in every document. Should we add some documentation about this or
is it already clear?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
If only the physical world exists, free will is an illusion.
Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres
From
Bruce Momjian
Date:
On Fri, Mar 4, 2022 at 10:22:11PM +0530, Atri Sharma wrote: > TF/IDF should be pretty simple to implement IMO. > > And no, Solr does not give preference to prior documents. > > However, Solr allows you to "boost" specific terms, thus creating the > impression of preference. Postgres can do that too. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.