Thread:
[I hope job postings are kosher...] I need help optimizing a PostgreSQL application: Full-text search ~17,000 records Articles (text) are about 10K long on average, ranging from 0 to 278K. I don't know if we need to throw more RAM, more hard drive, more comparison RAM in postmaster.conf or build a concordance or if this is just not something that can be done within our budget. I can't even seem to get the PostgreSQL profiling output using "-s" in the startup of postmaster and client to determine what the db engine is doing. I don't understand why PostgreSQL sometimes chooses not to use the existing INDEXes to do an index scan instead of sequential scan -- Does it really think sequential will be faster, or does it eliminate an index scan because there won't be enough hard drive or swap space to do it? Currently, full text search queries take on the order of 2 minutes to execute. We need them to be happening in 5 seconds, if at all possible. Unfortunately, this needs to happen EARLY THIS WEEK, if at all possible. Contact me off-list with some idea of price/availability/references if you are interested in taking on this task. THANKS!
> I don't understand why PostgreSQL sometimes chooses not to use the > existing INDEXes to do an index scan instead of sequential scan -- Does it > really think sequential will be faster, or does it eliminate an index scan Yes, and it's generally right. > because there won't be enough hard drive or swap space to do it? Nope. Simply because of time it takes to read from the disk. An index scan makes ~ 1 read per tuple and sequential scans make one per page (gross simplification). > Currently, full text search queries take on the order of 2 minutes to > execute. > We need them to be happening in 5 seconds, if at all possible. How about a couple of explains of the queries. What kind of tuning have you done in postgresql.conf. Whats your hardware like? Have you partitioned the data to separate disks in any way? Are you doing mostly (all?) reads? Some writes? Perhaps clustering? Is this on 7.2 or 7.3? What is the Locale? C or en_US or something else? -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
>> I don't understand why PostgreSQL sometimes chooses not to use the >> existing INDEXes to do an index scan instead of sequential scan -- >> Does it really think sequential will be faster, or does it eliminate >> an index scan > > Yes, and it's generally right. > >> because there won't be enough hard drive or swap space to do it? > > Nope. Simply because of time it takes to read from the disk. An index > scan makes ~ 1 read per tuple and sequential scans make one per page > (gross simplification). Hmmm. An "index" is apparently nothing like I expected it to be... Here I thought it would be some quick hash-table small data-set lookup with a reference to the OID -- and that most of the hash tables could just be loaded in one fell swoop. Oh well. >> Currently, full text search queries take on the order of 2 minutes to >> execute. >> We need them to be happening in 5 seconds, if at all possible. > > How about a couple of explains of the queries. Explains were posted previously, but I'll do a couple more. At its simplest, this takes 30 seconds: explain select article.* from article where lower(text) like '%einstein%'; NOTICE: QUERY PLAN: Seq Scan on article (cost=0.00..1155.01 rows=1 width=216) Or, slightly more complex: explain SELECT DISTINCT *, 0 + (0 + 10 * (lower(title) like '%einstein%') ::int + 10 * (lower(author_flattened) like '%einstein%') ::int + 30 * (lower(subject_flattened) like '%einstein%') ::int + 30 * (lower(text) LIKE '%einstein%') ::int + 9 * (substring(lower(title), 1, 20) like '%einstein%') ::int + 25 * (substring(lower(text), 1, 20) LIKE '%einstein%') ::int ) AS points FROM article WHERE TRUE AND (FALSE OR (lower(title) like '%einstein%') OR (lower(author_flattened) like '%einstein%') OR (lower(subject_flattened) like '%einstein%') OR (lower(text) LIKE '%einstein%') ) ORDER BY points desc, volume, number, article.article LIMIT 10, 0; NOTICE: QUERY PLAN: Limit (cost=1418.03..1418.08 rows=1 width=216) -> Unique (cost=1418.03..1418.08 rows=1 width=216) -> Sort (cost=1418.03..1418.03 rows=1 width=216) -> Seq Scan on article (cost=0.00..1418.02 rows=1 width=216) > What kind of tuning have > you done in postgresql.conf. None. Never really understood what that one memory setting would affect... And the rest of the options seemed to be about logging output (which I also can't seem to crank up to the level of getting query analysis out). I RTFM, but actually comprehending what was written ... :-^ > Whats your hardware like? processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 11 model name : Intel(R) Pentium(R) III CPU family 1400MHz stepping : 1 cpu MHz : 1406.005 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 2804.94 total: used: free: shared: buffers: cached: Mem: 921235456 736669696 184565760 749568 75321344 592257024 Swap: 2097143808 15368192 2081775616 MemTotal: 899644 kB MemFree: 180240 kB MemShared: 732 kB Buffers: 73556 kB Cached: 573896 kB SwapCached: 4480 kB Active: 433776 kB Inact_dirty: 182208 kB Inact_clean: 36680 kB Inact_target: 229376 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 899644 kB LowFree: 180240 kB SwapTotal: 2047992 kB SwapFree: 2032984 kB > Have you > partitioned the data to separate disks in any way? No, except when attempting to do the PostgreSQL contrib/fulltextindex we clustered the _fti table by loading it in word order. > Are you doing mostly (all?) reads? Some writes? Perhaps clustering? Mostly reads. Some writes by: Admin fixing typos, adding new articles Nightly cron jobs to "flatten" large-scale JOINs into text contatenations (We could get rid of that and go back to the JOINs, now that we've figured out that it's really the full text search that's killing us, not the JOINs) > Is this on 7.2 or 7.3? 7.1.3 > What is the Locale? C or en_US or something > else? AFAIK, we didn't do anything to alter the locale from whatever the default would be...
typea@l-i-e.com kirjutas T, 03.12.2002 kell 01:45: > Explains were posted previously, but I'll do a couple more. > > At its simplest, this takes 30 seconds: > > explain select article.* from article where lower(text) like '%einstein%'; > NOTICE: QUERY PLAN: > > Seq Scan on article (cost=0.00..1155.01 rows=1 width=216) searches with LIKE use indexes only when the like expression starts with a string (like 'einstein%') and even then only if in C locale. You should check out some real full-text index add-ons, like contrib/tsearch or construct your own using your imagination plus contrib/intarray and contrib/intagg :) --------------- Hannu
On Mon, 2002-12-02 at 12:30, typea@l-i-e.com wrote: > [I hope job postings are kosher...] > > I need help optimizing a PostgreSQL application: > > Full-text search > ~17,000 records > Articles (text) are about 10K long on average, ranging from 0 to 278K. > > I don't know if we need to throw more RAM, more hard drive, more > comparison RAM in postmaster.conf or build a concordance or if this is > just not something that can be done within our budget. > > I can't even seem to get the PostgreSQL profiling output using "-s" in the > startup of postmaster and client to determine what the db engine is doing. > > I don't understand why PostgreSQL sometimes chooses not to use the > existing INDEXes to do an index scan instead of sequential scan -- Does it > really think sequential will be faster, or does it eliminate an index scan > because there won't be enough hard drive or swap space to do it? > > Currently, full text search queries take on the order of 2 minutes to > execute. > We need them to be happening in 5 seconds, if at all possible. > > Unfortunately, this needs to happen EARLY THIS WEEK, if at all possible. > > Contact me off-list with some idea of price/availability/references if you > are interested in taking on this task. After reading the thread to see that your box has what looks like 1GB RAM, and firing up bc(1) to see that 17K articles each of which is ~10KB == 166MB, it seems to this simple mind that given enough buffers, you could suck all of the articles into the buffers. Thus, no more disk IO, but boy would it burn up the CPU! Also, I think that I might write some sort of "book index pre-processor" to run against each article, to create, for each article, a list of words plus byte offsets. (Some tweaking would have to occur in order to handle capitalization vagaries. Probably capitalize all "index words".) (Yes, this method has the limitation of [sub-]word searches instead of arbitrary string searches, Then, insert all that data into a 3rd table (T_LOOKUP) whose structure is: val TEXT (primary key) article_name TEXT byte_offset INTEGER Then, 'EINSTEIN%' queries would go against T_LOOKUP instead of the articles table. -- +------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "they love our milk and honey, but preach about another | | way of living" | | Merle Haggard, "The Fighting Side Of Me" | +------------------------------------------------------------+
for the maximum number of tables in a database. I'm thinking about separating a table with up to millions of rows into several tables with the same set of columns to speed up some complex queries. As the size of the original table is increasing fast, I want to get it separated once the size grows up to a limit. So there will be a large amount of tables (having same structure) in a database. Is there any potential performance problem with this design? Thanks. Li Li
On Mon, Dec 02, 2002 at 09:46:43PM -0800, li li wrote: > > for the maximum number of tables in a database. <http://www.ca.postgresql.org/users-lounge/limitations.html> For practical purposes, probably not. > to get it separated once the size grows up to a limit. So there > will be a large amount of tables (having same structure) in a database. Is > there any potential performance problem with this design? It depends on what you're going to do. If the idea is to join across the tables, it'll probably perform worse than just ahving a large table. OTOH, if what you're doing is, say, archiving from time to time, it doesn't seem unreasonable. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
> > It depends on what you're going to do. If the idea is to join across > the tables, it'll probably perform worse than just ahving a large > table. OTOH, if what you're doing is, say, archiving from time to > time, it doesn't seem unreasonable. > The purpose for this design is to avoid record lookup in a huge table. I expect to see the query results in, say, one minute, by searching a much smaller table (not join across multiple tables). Thanks and regards. Li Li
On Tuesday 03 Dec 2002 7:49 pm, li li wrote: > > It depends on what you're going to do. If the idea is to join across > > the tables, it'll probably perform worse than just ahving a large > > table. OTOH, if what you're doing is, say, archiving from time to > > time, it doesn't seem unreasonable. > > The purpose for this design is to avoid record lookup in a huge table. > I expect to see the query results in, say, one minute, by searching a much > smaller table (not join across multiple tables). > > Thanks and regards. If you only want *most* queries to finish in one minute - I've used two tables in the past. One for recent info (which is what most of my users wanted) and one for older info (which only got accessed rarely). You're only union-ing two tables then and you can cluster the older table as mentioned elsewhere. -- Richard Huxton Archonet Ltd
On Wed, 2002-12-04 at 09:29, Richard Huxton wrote: > On Tuesday 03 Dec 2002 7:49 pm, li li wrote: > > > It depends on what you're going to do. If the idea is to join across > > > the tables, it'll probably perform worse than just ahving a large > > > table. OTOH, if what you're doing is, say, archiving from time to > > > time, it doesn't seem unreasonable. > > > > The purpose for this design is to avoid record lookup in a huge table. > > I expect to see the query results in, say, one minute, by searching a much > > smaller table (not join across multiple tables). > > > > Thanks and regards. > > If you only want *most* queries to finish in one minute - I've used two tables > in the past. One for recent info (which is what most of my users wanted) and > one for older info (which only got accessed rarely). You're only union-ing > two tables then and you can cluster the older table as mentioned elsewhere. ANother approach could be to have index on timestamp field (which should be naturally clustered) and search in recent data only. If the problem is simply too much data returned, you could use LIMIT. -- Hannu Krosing <hannu@tm.ee>