Thread: 7.0.2 issues / Geocrawler
Well, I took Thomas' advice and upgraded to 7.0.2 from source.tar.gz. For some reason, I cannot create the following index: db_geocrawler=# DROP INDEX idx_mail_archive_list_subject; ERROR: index "idx_mail_archive_list_subject" nonexistent db_geocrawler=# CREATE INDEX "idx_mail_archive_list_subject" on "tbl_mail_archive" using btree ( "fld_mail_ list" "int4_ops", "fld_mail_subject" "text_ops" ); ERROR: cannot create idx_mail_archive_list_subject [root@geocrawler db_geocrawler]# rm -f idx_mail_archive_list_subject That removes the physical file on disk, so I can then try to create it again. If I then issue the SQL command, postgres accepts it and it runs forever, never creating more than an 8192 byte file. If you watch your process list: [root@geocrawler db_geocrawler]# ps ax PID TTY STAT TIME COMMAND 457 ? SW 0:00 [postmaster]1419 ? R 1:34 [postmaster] Eventually, the psql connection disappears from the process list and I get strange postmaster processes running (see above). After that, I get this error from psql: ERROR: btree: index item size 2820 exceeds maximum 2717 Any way to tell where that item is at? From the pgserver.log file: DEBUG: Data Base System is starting up at Tue Jul 11 16:59:33 2000 DEBUG: Data Base System was interrupted being in production at Tue Jul 11 15:47:04 2000 DEBUG: Data Base System is in production state at Tue Jul 11 16:59:33 2000 ...Doesn't give me much to go on. I'm really at wits end - I've spent over two days trying to rebuild Geocrawler. Next step is reformatting the hard disk and reinstalling postgres 6.4.2. Tim -- Founder - PHPBuilder.com / Geocrawler.com Lead Developer - SourceForge VA Linux Systems 408-542-5723
Tim Perdue wrote: > ... > After that, I get this error from psql: > > ERROR: btree: index item size 2820 exceeds maximum 2717 > > Any way to tell where that item is at? I've been wondering at the state of the problems you've been having with PostgreSQL and wondering why I haven't experienced the same. I think this may very well be it. Earlier versions of PostgreSQL allowed for the creation of indexes on fields whose length would not permit at least 2 entries per index page. 95% of the time, things would work fine. But 5% you would get corrupted data. Before creating the index: SELECT * FROM tbl_main_archive WHERE Length(fld_mail_subject) > 2700; will get you the list of records which cannot be indexed. You're attempting to create a multi-key index so I would truncate (or delete) any record whose fld_mail_subject is > 2700: UPDATE tbl_main_archive SET fld_mail_subject = SubStr(fld_mail_subject, 1, 2700); At this point, your index creation should be relatively quick (and successful) depending upon how many rows you have. I have a few tables with ~2 million rows that take about 5 - 10 minutes (with fsync off, naturally) to index. I would also recommend letting PostgreSQL determine the correct "ops": CREATE INDEX idx_mail_archive_list_subject ON tbl_mail_archive (fld_mail_list, fld_mail_subject); Without following the lists every day, most people wouldn't know about this issue. I'm surprised it took so long for PostgreSQL 7.0.2 to bail on the index creation though. Is this a particularly large table? At any rate, this is an example of a bug which *would* allow for the kinds of corruption you've seen in the past that has been addressed in 7.0.2, as Tom Lane crushed them by the hundreds. If you can: psql db_geocrawler < 6_4dump.txt and it never bails, then you know all your data is "clean". Until that point, any index you have on a "text" datatype is subject to similar problems. Hope that helps, Mike Mascari
This is a *big* help. Yes, the table is approx 10-12GB in size and running your length() and update queries is going to take a lifetime, since it will require a calculation on 4 million rows. This doesn't address the serious performance problem I'm finding in 7.0.2 for a multi-key select/order by/limit/offset query, which I sent in a separate email. Tim Mike Mascari wrote: > > Tim Perdue wrote: > > ... > > After that, I get this error from psql: > > > > ERROR: btree: index item size 2820 exceeds maximum 2717 > > > > Any way to tell where that item is at? > > I've been wondering at the state of the problems you've been > having with PostgreSQL and wondering why I haven't experienced > the same. I think this may very well be it. Earlier versions of > PostgreSQL allowed for the creation of indexes on fields whose > length would not permit at least 2 entries per index page. 95% of > the time, things would work fine. But 5% you would get corrupted > data. > > Before creating the index: > > SELECT * FROM tbl_main_archive WHERE Length(fld_mail_subject) > > 2700; > > will get you the list of records which cannot be indexed. You're > attempting to create a multi-key index so I would truncate (or > delete) any record whose fld_mail_subject is > 2700: > > UPDATE tbl_main_archive SET fld_mail_subject = > SubStr(fld_mail_subject, 1, 2700); > > At this point, your index creation should be relatively quick > (and successful) depending upon how many rows you have. I have a > few tables with ~2 million rows that take about 5 - 10 minutes > (with fsync off, naturally) to index. I would also recommend > letting PostgreSQL determine the correct "ops": > > CREATE INDEX idx_mail_archive_list_subject > ON tbl_mail_archive (fld_mail_list, fld_mail_subject); > > Without following the lists every day, most people wouldn't know > about this issue. I'm surprised it took so long for PostgreSQL > 7.0.2 to bail on the index creation though. Is this a > particularly large table? At any rate, this is an example of a > bug which *would* allow for the kinds of corruption you've seen > in the past that has been addressed in 7.0.2, as Tom Lane crushed > them by the hundreds. If you can: > > psql db_geocrawler < 6_4dump.txt > > and it never bails, then you know all your data is "clean". Until > that point, any index you have on a "text" datatype is subject to > similar problems. > > Hope that helps, > > Mike Mascari -- Founder - PHPBuilder.com / Geocrawler.com Lead Developer - SourceForge VA Linux Systems 408-542-5723
Tim Perdue wrote: > > This is a *big* help. > > Yes, the table is approx 10-12GB in size and running your length() and > update queries is going to take a lifetime, since it will require a > calculation on 4 million rows. > > This doesn't address the serious performance problem I'm finding in > 7.0.2 for a multi-key select/order by/limit/offset query, which I sent > in a separate email. > > Tim If I recall correctly, Marc experienced similar performance differences with UDM search after upgrading. The optimizer was redesigned to be smarter about using indexes with both order by and limit. Tom Lane, of course, knows all there is to know on this. All I can ask is standard issue precursor to optimizer questions: Have you VACUUM ANALYZE'd the table(s) in question? If so, hopefully Tom Lane can comment. Sorry I couldn't be more help, Mike Mascari
Mike Mascari wrote: > Have you VACUUM ANALYZE'd the table(s) in question? Yes, they've been vacuum analyze'd and re-vaccum analyze'd to death. Also added some extra indexes that I don't really need just to see if that helps. Tim -- Founder - PHPBuilder.com / Geocrawler.com Lead Developer - SourceForge VA Linux Systems 408-542-5723
On Wed, Jul 12, 2000 at 06:17:23AM -0700, Tim Perdue wrote: > Mike Mascari wrote: > > Have you VACUUM ANALYZE'd the table(s) in question? > > Yes, they've been vacuum analyze'd and re-vaccum analyze'd to death. > Also added some extra indexes that I don't really need just to see if > that helps. Tim, why are you building a multikey index, especially one containing a large text field? It's almost never a win to index a text field, unless all the WHERE clauses that use it are either anchored to the beginning of the field, or are equality tests (in which case, the field is really an enumerated type, masquerading as a text field) A multikey index is only useful for a very limited set of queries. Here's a message from last August, where Tom Lane talks about that: http://www.postgresql.org/mhonarc/pgsql-sql/1999-08/msg00145.html Ross -- Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> NSBRI Research Scientist/Programmer Computer and Information Technology Institute Rice University, 6100 S. Main St., Houston, TX 77005
"Ross J. Reedstrom" wrote: > Tim, why are you building a multikey index, especially one containing a > large text field? It's almost never a win to index a text field, unless This is not a key on a text field. The keys are: mail_list (example, the PHP mailing list=1) mail_year (1999) mail_month (July=7) Yes it is a multi-key index, and the matches are exact. Someone else asked why I have separated these fields out from the mail_date. If I didn't, and I wanted to see the messages for this month, I'd have to regex and that would overwhelm the database. Tim -- Founder - PHPBuilder.com / Geocrawler.com Lead Developer - SourceForge VA Linux Systems 408-542-5723
"Ross J. Reedstrom" wrote: > > On Wed, Jul 12, 2000 at 06:17:23AM -0700, Tim Perdue wrote: > > Mike Mascari wrote: > > > Have you VACUUM ANALYZE'd the table(s) in question? > > > > Yes, they've been vacuum analyze'd and re-vaccum analyze'd to death. > > Also added some extra indexes that I don't really need just to see if > > that helps. > > Tim, why are you building a multikey index, especially one containing a > large text field? It's almost never a win to index a text field, unless > all the WHERE clauses that use it are either anchored to the beginning > of the field, or are equality tests (in which case, the field is really > an enumerated type, masquerading as a text field) > > A multikey index is only useful for a very limited set of queries. Here's > a message from last August, where Tom Lane talks about that: > > http://www.postgresql.org/mhonarc/pgsql-sql/1999-08/msg00145.html I think Tim had 2 problems. The first was tuples whose text attributes did not permit two on the same index page. The second, however, is that a query against the *same schema* under 6.x now runs slower by a factor of 15 under 7.x: "The following query is at the very heart of the site and it takes upwards of 15-20 seconds to run now. It used to be instantaneous. explain SELECT mailid, mail_date, mail_is_followup, mail_from, mail_subject FROM mail_archive WHERE mail_list=35 AND mail_year=2000AND mail_month=1 ORDER BY mail_date DESC LIMIT 26 OFFSET0; NOTICE: QUERY PLAN: Sort (cost=138.41..138.41 rows=34 width=44) -> Index Scan using idx_mail_archive_list_yr_mo on tbl_mail_archive (cost=0.00..137.55 rows=34 width=44) EXPLAIN" Even though he's using a mult-key index here, it is composed entirely of integer fields. Its reducing to a simple index scan + sort, so I don't see how the performance could drop off so dramatically. Perhaps if we could see the EXPLAIN output with the same query against the 6.x database we could see what's going on. Mike Mascari
On Wed, 12 Jul 2000, Mike Mascari wrote: > Tim Perdue wrote: > > > > This is a *big* help. > > > > Yes, the table is approx 10-12GB in size and running your length() and > > update queries is going to take a lifetime, since it will require a > > calculation on 4 million rows. > > > > This doesn't address the serious performance problem I'm finding in > > 7.0.2 for a multi-key select/order by/limit/offset query, which I sent > > in a separate email. > > > > Tim > > If I recall correctly, Marc experienced similar performance > differences with UDM search after upgrading. The optimizer was > redesigned to be smarter about using indexes with both order by > and limit. Tom Lane, of course, knows all there is to know on > this. All I can ask is standard issue precursor to optimizer > questions: it was a problem with v7.0 that Tom provided a work around for, but I'm 99% certain that the work around was included in v7.0.1 ...
On Wed, 12 Jul 2000, Tim Perdue wrote: > Mike Mascari wrote: > > Have you VACUUM ANALYZE'd the table(s) in question? > > Yes, they've been vacuum analyze'd and re-vaccum analyze'd to death. > Also added some extra indexes that I don't really need just to see if > that helps. what does EXPLAIN <query>; show and what is the QUERY itself that is so slow?
On Wed, 12 Jul 2000, Tim Perdue wrote: > "Ross J. Reedstrom" wrote: > > Tim, why are you building a multikey index, especially one containing a > > large text field? It's almost never a win to index a text field, unless > > This is not a key on a text field. > > The keys are: > > mail_list (example, the PHP mailing list=1) > mail_year (1999) > mail_month (July=7) > > Yes it is a multi-key index, and the matches are exact. > > Someone else asked why I have separated these fields out from the > mail_date. > > If I didn't, and I wanted to see the messages for this month, I'd have > to regex and that would overwhelm the database. if you did it as a proper date field, you can use stuff like 'date_part' and 'date_trunc' to pull out a particular month, year, etc ...
On Wed, Jul 12, 2000 at 08:14:29AM -0700, Tim Perdue wrote: > "Ross J. Reedstrom" wrote: > > Tim, why are you building a multikey index, especially one containing a > > large text field? It's almost never a win to index a text field, unless > > This is not a key on a text field. > Ah, I see, I had merged the two problems you reported together. I see now that the 'can't create index' problem was on a different index. Mike Mascari gave you a detailed answer to that, which you seemd to just blow off, based on you guesstimate that it would run too long: > This is a *big* help. > > Yes, the table is approx 10-12GB in size and running your length() and > update queries is going to take a lifetime, since it will require a > calculation on 4 million rows. Mike mentioned that he's run similar index creations on 2 million rows, and it took 5-10 minutes. I reiterate: you've got a long subject that tripped a bug in index creation in postgresql versions < 7.0. Give his solution a try. It's a 'clean it up once' sort of thing: I don't think anyone's going to complain about the subject getting trimmed at ~ 2k. > The keys are: > > mail_list (example, the PHP mailing list=1) > mail_year (1999) > mail_month (July=7) > > Yes it is a multi-key index, and the matches are exact. > Right, as your explain output showed: the planner is picking this index and using it. I'd guess that your time is getting lost in the sort step. I seem to recall that Tom reworked the sort code as well, to reduce the size of temporary sort files: perhaps you've found a corner case that is much slower. Do you still have the 6.X install available? EXPLAIN output from that would be useful. > Someone else asked why I have separated these fields out from the > mail_date. > > If I didn't, and I wanted to see the messages for this month, I'd have > to regex and that would overwhelm the database. That's what the date_part function is for: reedstrm=# select now(); now ------------------------2000-07-12 11:03:11-05 (1 row) reedstrm=# select date_part('month', now());date_part ----------- 7 (1 row) reedstrm=# select date_part('year', now());date_part ----------- 2000 (1 row) So your query would look like: SELECT mailid, mail_date, mail_is_followup, mail_from, mail_subject FROM mail_archive WHERE mail_list=35 AND date_part('year',mail_date)=2000 AND date_part('month',mail_date)=1 ORDER BY mail_date DESC LIMIT 26 OFFSET 0; You can even build functional indices. However, since you're selecting and sorting based on the same attribute, the time of the message, it should be possible to build an index on mail_date, and construct a SELECT that uses it for ordering as well as limiting the tuples returned. You're generating the queries programmatically, from a scripting language, right? So, the best thing would be if you could create a query that looks something like: SELECT mailid, mail_date, mail_is_followup, mail_from, mail_subject FROM mail_archive WHERE mail_list=35 AND mail_date >='January 1, 2000' AND mail_date < 'February 1, 2000' ORDER BY mail_date DESC LIMIT 26 OFFSET 0; With an index on mail_date, that should do a single index scan, returning the first 26, and stop. I'd bet a lot that it's the sort that's killing you, since the backend has to retrieve the entire result set and sort it to be sure it returns the first 26. You might be able to use a two key index, on mail_date, mailid. I think you have to be careful to put key you want sorted output on first, to ensure that the index order is presorted, and the planner know it. Ross -- Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> NSBRI Research Scientist/Programmer Computer and Information Technology Institute Rice University, 6100 S. Main St., Houston, TX 77005
On Wed, Jul 12, 2000 at 11:36:44AM -0500, Ross J. Reedstrom wrote: > > You might be able to use a two key index, on mail_date, mailid. I think > you have to be careful to put key you want sorted output on first, > to ensure that the index order is presorted, and the planner know it. Bah, I clearly need lunch: that last sentence, with better grammar: [...] be careful to put the key you want output sorted on first, to ensure that the index order is presorted, and thatthe planner knows it. Ross -- Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> NSBRI Research Scientist/Programmer Computer and Information Technology Institute Rice University, 6100 S. Main St., Houston, TX 77005
"Ross J. Reedstrom" wrote: > Mike Mascari gave you a detailed answer to that, which you seemd to just blow > off, based on you guesstimate that it would run too long: That is a separate issue - unrelated to this performance issue and it was not "blown" off, I was merely making a comment. > Right, as your explain output showed: the planner is picking this index > and using it. I'd guess that your time is getting lost in the sort step. I think you're probably right. It's hard to imagine that sorting is that much slower, but it's hard to say. Your ideas for selecting based on the date are intriguing, however the schema of the db was not done with that in mind. Everyone thinks I'm a nut when I say this, but the date is stored in a char(14) field in gregorian format: 19990101125959 So perhaps sorting a char(14) field is somehow majorly slower now. No I don't have 6.5.3 installed anymore - it was totally fubar and wasn't running anymore. Tim -- Founder - PHPBuilder.com / Geocrawler.com Lead Developer - SourceForge VA Linux Systems 408-542-5723
On Wed, 12 Jul 2000, Tim Perdue wrote: > "Ross J. Reedstrom" wrote: > > Mike Mascari gave you a detailed answer to that, which you seemd to just blow > > off, based on you guesstimate that it would run too long: > > That is a separate issue - unrelated to this performance issue and it > was not "blown" off, I was merely making a comment. > > > Right, as your explain output showed: the planner is picking this index > > and using it. I'd guess that your time is getting lost in the sort step. > > I think you're probably right. It's hard to imagine that sorting is that > much slower, but it's hard to say. just curious, but what if you remove the ORDER BY, just to test ... is it that much faster without then with? Just want to narrow down *if* its a sorting issue or not, that's all ... If it is a sorting issue, what if you raise the -S value?
The Hermit Hacker wrote: > just curious, but what if you remove the ORDER BY, just to test ... is it > that much faster without then with? Just want to narrow down *if* its a > sorting issue or not, that's all ... Good call - it was instantaneous as it used to be. > If it is a sorting issue, what if you raise the -S value? -S is 32768 right now Tim -- Founder - PHPBuilder.com / Geocrawler.com Lead Developer - SourceForge VA Linux Systems 408-542-5723
On Wed, 12 Jul 2000, Tim Perdue wrote: > The Hermit Hacker wrote: > > just curious, but what if you remove the ORDER BY, just to test ... is it > > that much faster without then with? Just want to narrow down *if* its a > > sorting issue or not, that's all ... > > Good call - it was instantaneous as it used to be. It takes us awhile sometimes, but we eventually clue in :) > > If it is a sorting issue, what if you raise the -S value? > > -S is 32768 right now how many results come back from the query? ignoring the LIMIT, that is ... see, it has to ORDER BY before the LIMIT, of course...
Tim Perdue <tperdue@valinux.com> writes: > The Hermit Hacker wrote: >> just curious, but what if you remove the ORDER BY, just to test ... is it >> that much faster without then with? Just want to narrow down *if* its a >> sorting issue or not, that's all ... > Good call - it was instantaneous as it used to be. How much data is getting passed through the sort step --- you might need to raise the query LIMIT to find out... regards, tom lane