Thread: VARCHAR -vs- CHAR: huge performance difference?
Hey folks, I am working on a rather small, simple database. I'm running 7.3.5 on an OpenBSD 3.5-STABLE machine (1.3GHz, 512M RAM, IDE drive using UDMA-5). I am parsing and storing emails, so I have a lot of character string data to worry about. In one particular table, I have 26 columns of type varchar, with widths of 24-8192 characters. This is the table that is giving me some pretty severe performance problems. For example, many of the statistics I run against these emails is a simple count(), whether it be on a column or *. Right now, counting the rows takes in excess of 3.5 seconds. OK, that's pretty slow. However, there are only --> 24,000 <-- rows in this table! I could see it taking a long time if I had 500,000 rows, or a million, but 24,000? Now, I am not an expert with database design whatsoever, so I fully admit that having a table with 26 varchar columns (along with a handful of other fixed-width columns) probably isn't the brightest design. :) So, I went ahead and created an exact copy of this table, with the exception of creating all character columns as type char(), not varchar(). I was pondering if making PostgreSQL worry about the varying lengths by using varchar was the problem... And sure enough, counting the rows on the new table takes around 148ms. That's a pretty big difference from 3600ms. And no, 3.6 seconds doesn't sound like much, until you have several operations on the same table to draw a single web page, and suddenly the page takes 20s to load. :( So, my question for the gurus: is using varchars THAT detrimental? Or am I just going way way overboard by having so _many_ in a single table? Or am I an idiot? (honest question - I'm armpit-deep in learning mode here) I'm also curious to know if I've crossed some invisible line with the number of columns/width of rows that makes performance degrade rapidly. If further info is needed, please ask - I just didn't want to spam the list with further table definitions and explain analyze output if it wasn't needed. Benny -- "Oh, the Jedis are going to feel this one!" -- Professor Farnsworth, "Futurama"
On Tue, 2004-06-15 at 18:58, C. Bensend wrote: > Hey folks, > > I am working on a rather small, simple database. I'm running 7.3.5 on > an OpenBSD 3.5-STABLE machine (1.3GHz, 512M RAM, IDE drive using UDMA-5). > > I am parsing and storing emails, so I have a lot of character string > data to worry about. In one particular table, I have 26 columns of type > varchar, with widths of 24-8192 characters. This is the table that is > giving me some pretty severe performance problems. > > For example, many of the statistics I run against these emails is a > simple count(), whether it be on a column or *. Right now, counting the > rows takes in excess of 3.5 seconds. OK, that's pretty slow. However, > there are only --> 24,000 <-- rows in this table! I could see it taking > a long time if I had 500,000 rows, or a million, but 24,000? > > Now, I am not an expert with database design whatsoever, so I fully > admit that having a table with 26 varchar columns (along with a handful > of other fixed-width columns) probably isn't the brightest design. :) > So, I went ahead and created an exact copy of this table, with the > exception of creating all character columns as type char(), not varchar(). > I was pondering if making PostgreSQL worry about the varying lengths > by using varchar was the problem... > > And sure enough, counting the rows on the new table takes around > 148ms. That's a pretty big difference from 3600ms. And no, 3.6 seconds > doesn't sound like much, until you have several operations on the same > table to draw a single web page, and suddenly the page takes 20s to > load. :( Were those fields populated just like the varchar fields? If not, then the test proves little. If so, I find it hard to believe that char(x) would be any faster than varchar. They're all handled about the same. If you want to do count(*) on the table, do it by having a table with nothing but IDs in it that is referenced by the table with all the text. PostgreSQL can't really optimized aggregate functions with indexes, so it always winds up doing seq scans.
> Were those fields populated just like the varchar fields? If not, then > the test proves little. If so, I find it hard to believe that char(x) > would be any faster than varchar. They're all handled about the same. Hi Scott, Yes, the new table was populated from the data from the original, via a: INSERT INTO emails2 SELECT * FROM emails; This should be correct, yes? > If you want to do count(*) on the table, do it by having a table with > nothing but IDs in it that is referenced by the table with all the > text. PostgreSQL can't really optimized aggregate functions with > indexes, so it always winds up doing seq scans. I have also tried doing a count(column) on the emails table, using a column that is indexed. It showed no improvement - I would have expected at least a little gain. And here's the clincher - when I do a count(*) on a different table with the same number of rows but only four varchar columns, it returns the result in 75ms. (!!!) Benny -- "Oh, the Jedis are going to feel this one!" -- Professor Farnsworth, "Futurama"
"C. Bensend" <benny@bennyvision.com> writes: > So, I went ahead and created an exact copy of this table, with the > exception of creating all character columns as type char(), not varchar(). > I was pondering if making PostgreSQL worry about the varying lengths > by using varchar was the problem... The above transformation is a guaranteed loser in Postgres. > And sure enough, counting the rows on the new table takes around > 148ms. That's a pretty big difference from 3600ms. I'm betting that the original table is physically huge because you've not vacuumed it regularly. The copying produced a table with no wasted space, so physically smaller even though the live data is noticeably bigger (because of all the padding blanks you forced to be added). Check what VACUUM VERBOSE has to say about each of these tables... regards, tom lane
> "C. Bensend" <benny@bennyvision.com> writes: >> So, I went ahead and created an exact copy of this table, with the >> exception of creating all character columns as type char(), not >> varchar(). >> I was pondering if making PostgreSQL worry about the varying lengths >> by using varchar was the problem... > > The above transformation is a guaranteed loser in Postgres. Hi Tom, By transformation, do you mean the varchar() -> char() change? If so, I'm not sure I understand - it certainly improved the performance. Or am I misunderstanding? > I'm betting that the original table is physically huge because you've > not vacuumed it regularly. The copying produced a table with no wasted > space, so physically smaller even though the live data is noticeably > bigger (because of all the padding blanks you forced to be added). > > Check what VACUUM VERBOSE has to say about each of these tables... Actually, all databases on this server are vacuumed nightly, right before backups. But here is the data: prod01=> vacuum verbose emails; INFO: --Relation public.emails-- INFO: Index emails_email_id_idx: Pages 358; Tuples 24198: Deleted 82. CPU 0.03s/0.01u sec elapsed 0.41 sec. INFO: Index emails_date_received_idx: Pages 325; Tuples 24198: Deleted 82. CPU 0.00s/0.00u sec elapsed 0.63 sec. INFO: Removed 82 tuples in 23 pages. CPU 0.00s/0.00u sec elapsed 0.06 sec. INFO: Pages 5793: Changed 0, Empty 0; Tup 24198: Vac 82, Keep 0, UnUsed 71757. Total CPU 0.24s/0.06u sec elapsed 4.71 sec. INFO: --Relation pg_toast.pg_toast_399420-- INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM prod01=> vacuum verbose emails2; INFO: --Relation public.emails2-- INFO: Pages 2646: Changed 0, Empty 0; Tup 24162: Vac 0, Keep 0, UnUsed 0. Total CPU 0.10s/0.00u sec elapsed 1.00 sec. INFO: --Relation pg_toast.pg_toast_859969-- INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM Thanks very much, Benny -- "Oh, the Jedis are going to feel this one!" -- Professor Farnsworth, "Futurama"
"C. Bensend" <benny@bennyvision.com> writes: >> The above transformation is a guaranteed loser in Postgres. > By transformation, do you mean the varchar() -> char() change? Right. The only thing that will do for you is waste space with padding blanks. The performance improvement you saw was due to something else, which I theorize is not having to waste time reading dead space. Since you obviously doubt this assertion, try the same experiment except load the data into a new table with the *same* structure as the original (ie, all varchar). You'll see the same or probably better performance. > Actually, all databases on this server are vacuumed nightly, right > before backups. Not often enough evidently... > INFO: --Relation public.emails-- > INFO: Pages 5793: Changed 0, Empty 0; Tup 24198: Vac 82, Keep 0, UnUsed > 71757. > INFO: --Relation public.emails2-- > INFO: Pages 2646: Changed 0, Empty 0; Tup 24162: Vac 0, Keep 0, UnUsed 0. Note the difference in "pages". emails is nearly double the physical size, even though the live data in it is doubtless smaller. (You could use the contrib/pgstattuple functions to see exactly how much live data there is.) I was actually expecting to see more than a 2:1 difference in file size, seeing that you reported more than a 2:1 difference in read time. It could be that there is also some question of physical layout of the file on disk. The original table probably accreted over time and was given space that's not very consecutive on disk. The copied table was written in one swoop, more or less, and is very likely stored in a more nearly consecutive set of physical disk blocks. I'm not sure how you'd investigate this theory though --- AFAIK there isn't any real easy way to find out how badly fragmented a file is in most Unix filesystems. regards, tom lane
> Right. The only thing that will do for you is waste space with padding > blanks. The performance improvement you saw was due to something else, > which I theorize is not having to waste time reading dead space. > > Since you obviously doubt this assertion, try the same experiment except > load the data into a new table with the *same* structure as the original > (ie, all varchar). You'll see the same or probably better performance. Hi Tom, I don't doubt your assertion, I just don't understand it all yet. :) Let me assure you, you're the expert here, not I. I did as you suggested - I created a third copy of the table, using the exact same structure. And you're absolutely right - it was lightning fast (around 100ms). >> Actually, all databases on this server are vacuumed nightly, right >> before backups. > > Not often enough evidently... This statement worries me a bit. The data in the original table was bulk-loaded the other night (less than three days I'm sure), and new entries have been added at the approximate rate of 300 per day. Is this going to continue to happen? Or do I just need to vacuum more often? I _did_ try a vacuum before asking the list for help, but it didn't give any improvement (just a vacuum analyze). > Note the difference in "pages". emails is nearly double the physical > size, even though the live data in it is doubtless smaller. (You could > use the contrib/pgstattuple functions to see exactly how much live data > there is.) OK. I see (and understand) the pages value now. > I was actually expecting to see more than a 2:1 difference in file size, > seeing that you reported more than a 2:1 difference in read time. It > could be that there is also some question of physical layout of the file > on disk. The original table probably accreted over time and was given > space that's not very consecutive on disk. The copied table was > written in one swoop, more or less, and is very likely stored in a more > nearly consecutive set of physical disk blocks. I'm not sure how you'd > investigate this theory though --- AFAIK there isn't any real easy way > to find out how badly fragmented a file is in most Unix filesystems. Ugh. So, what would you recommend as a fix? I see the problem, and I see the fix that just worked, but I certainly can't be the only person around that is using a "wide" table with a lot of character data being added at a rather slow rate... Thank you _very much_ for all your help, Benny -- "Oh, the Jedis are going to feel this one!" -- Professor Farnsworth, "Futurama"
Hi Benny, What happens if you recreate the table using varchar? I mean use the same procedure yoused to create the table with char, but leave the definitions as varchar. Personally, I can't see any logical reason for char being faster. The problem is the size of the row, each row spans multiple database pages and all have to be read to do a count. It's a lot of IO, and varchar should actually be more compact in your case. I would think that the speedup you see is a result of the table having a fresh new organization. Anyway, I'm interested to hear what happens if you do the test above. I liked Scot's suggestion of using a small table containing only the IDs to use for counting. A serial scan of this small table could be expected to be much faster. Postgres doesnt use an index in the case you specified, as I understand it. Regards Iain ----- Original Message ----- From: "C. Bensend" <benny@bennyvision.com> To: <pgsql-admin@postgresql.org> Sent: Wednesday, June 16, 2004 11:12 AM Subject: Re: [ADMIN] VARCHAR -vs- CHAR: huge performance difference? > > > "C. Bensend" <benny@bennyvision.com> writes: > >> So, I went ahead and created an exact copy of this table, with the > >> exception of creating all character columns as type char(), not > >> varchar(). > >> I was pondering if making PostgreSQL worry about the varying lengths > >> by using varchar was the problem... > > > > The above transformation is a guaranteed loser in Postgres. > > Hi Tom, > > By transformation, do you mean the varchar() -> char() change? If > so, I'm not sure I understand - it certainly improved the performance. > Or am I misunderstanding? > > > I'm betting that the original table is physically huge because you've > > not vacuumed it regularly. The copying produced a table with no wasted > > space, so physically smaller even though the live data is noticeably > > bigger (because of all the padding blanks you forced to be added). > > > > Check what VACUUM VERBOSE has to say about each of these tables... > > Actually, all databases on this server are vacuumed nightly, right > before backups. But here is the data: > > prod01=> vacuum verbose emails; > INFO: --Relation public.emails-- > INFO: Index emails_email_id_idx: Pages 358; Tuples 24198: Deleted 82. > CPU 0.03s/0.01u sec elapsed 0.41 sec. > INFO: Index emails_date_received_idx: Pages 325; Tuples 24198: Deleted 82. > CPU 0.00s/0.00u sec elapsed 0.63 sec. > INFO: Removed 82 tuples in 23 pages. > CPU 0.00s/0.00u sec elapsed 0.06 sec. > INFO: Pages 5793: Changed 0, Empty 0; Tup 24198: Vac 82, Keep 0, UnUsed > 71757. > Total CPU 0.24s/0.06u sec elapsed 4.71 sec. > INFO: --Relation pg_toast.pg_toast_399420-- > INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. > Total CPU 0.00s/0.00u sec elapsed 0.00 sec. > VACUUM > prod01=> vacuum verbose emails2; > INFO: --Relation public.emails2-- > INFO: Pages 2646: Changed 0, Empty 0; Tup 24162: Vac 0, Keep 0, UnUsed 0. > Total CPU 0.10s/0.00u sec elapsed 1.00 sec. > INFO: --Relation pg_toast.pg_toast_859969-- > INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. > Total CPU 0.00s/0.00u sec elapsed 0.00 sec. > VACUUM > > Thanks very much, > > Benny > > > -- > "Oh, the Jedis are going to feel this one!" -- Professor Farnsworth, > "Futurama" > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
> Or do I just need to vacuum more often? I _did_ try a vacuum before > asking the list for help, but it didn't give any improvement (just a > vacuum analyze). "vacuum analyse" allows dead space to be re-used, but doesn't compact the table, you need "vacuum analyse full" for that. I'm not sure how the free space map relates to this, maybe someone can shed some light, but if it is too small, not all dead rows can be re-used (as I understand it). If your table (or even the DB in general) is updated alot, the default FSM size that comes with 7.3 may not be enough to ensure the most efficient space reclamation. Last time I looked the docs were a little hazy on that.
On Tue, 2004-06-15 at 21:09, C. Bensend wrote: > > Right. The only thing that will do for you is waste space with padding > > blanks. The performance improvement you saw was due to something else, > > which I theorize is not having to waste time reading dead space. > > > > Since you obviously doubt this assertion, try the same experiment except > > load the data into a new table with the *same* structure as the original > > (ie, all varchar). You'll see the same or probably better performance. > > Hi Tom, > > I don't doubt your assertion, I just don't understand it all yet. :) > Let me assure you, you're the expert here, not I. > > I did as you suggested - I created a third copy of the table, using > the exact same structure. And you're absolutely right - it was lightning > fast (around 100ms). > > >> Actually, all databases on this server are vacuumed nightly, right > >> before backups. > > > > Not often enough evidently... > > This statement worries me a bit. The data in the original table was > bulk-loaded the other night (less than three days I'm sure), and new > entries have been added at the approximate rate of 300 per day. Is this > going to continue to happen? > > Or do I just need to vacuum more often? I _did_ try a vacuum before > asking the list for help, but it didn't give any improvement (just a > vacuum analyze). > > > Note the difference in "pages". emails is nearly double the physical > > size, even though the live data in it is doubtless smaller. (You could > > use the contrib/pgstattuple functions to see exactly how much live data > > there is.) > > OK. I see (and understand) the pages value now. > > > I was actually expecting to see more than a 2:1 difference in file size, > > seeing that you reported more than a 2:1 difference in read time. It > > could be that there is also some question of physical layout of the file > > on disk. The original table probably accreted over time and was given > > space that's not very consecutive on disk. The copied table was > > written in one swoop, more or less, and is very likely stored in a more > > nearly consecutive set of physical disk blocks. I'm not sure how you'd > > investigate this theory though --- AFAIK there isn't any real easy way > > to find out how badly fragmented a file is in most Unix filesystems. > > Ugh. So, what would you recommend as a fix? I see the problem, and I > see the fix that just worked, but I certainly can't be the only person > around that is using a "wide" table with a lot of character data being > added at a rather slow rate... You might want to look into the autovacuum daemon, and / or increasing fsm settings to be large enough to hold all the spare tuples released by vacuuming.
Re: VARCHAR -vs- CHAR: huge performance difference?
From
"V i s h a l Kashyap @ [Sai Hertz And Control Systems]"
Date:
Dear C. Bensend wrote: > I am parsing and storing emails, so I have a lot of character string >data to worry about. In one particular table, I have 26 columns of type >varchar, with widths of 24-8192 characters. This is the table that is >giving me some pretty severe performance problems. > > For example, many of the statistics I run against these emails is a >simple count(), whether it be on a column or *. Right now, counting the >rows takes in excess of 3.5 seconds. OK, that's pretty slow. However, >there are only --> 24,000 <-- rows in this table! I could see it taking >a long time if I had 500,000 rows, or a million, but 24,000? > > > For similar post someone suggested to do a count like select count(*) from table where any = any limit 1 ; Try this look if thiers any performance increase. -- Regards, Vishal Kashyap Director / Lead Software Developer, Sai Hertz And Control Systems Pvt Ltd, http://saihertz.rediffblogs.com Yahoo IM: mailforvishal[ a t ]yahoo.com
> You might want to look into the autovacuum daemon, and / or increasing > fsm settings to be large enough to hold all the spare tuples released by > vacuuming. IIRC, the autovacuum stuff was added in 7.4, which I'm not running (yet). An upgrade to 7.4.3 might be prudent for me, while the database is still small. Benny -- "Oh, the Jedis are going to feel this one!" -- Professor Farnsworth, "Futurama"
On Wed, 16 Jun 2004 09:37:58 -0500 (CDT), C. Bensend <benny@bennyvision.com> wrote: > >> You might want to look into the autovacuum daemon, and / or increasing >> fsm settings to be large enough to hold all the spare tuples released by >> vacuuming. > > IIRC, the autovacuum stuff was added in 7.4, which I'm not running (yet). > An upgrade to 7.4.3 might be prudent for me, while the database is still > small. > > Benny > > By the way, be carefull. The pg_autovacuum is not good enough for being using as the only way for cleaning the database, especially if you have a lot of update and delete on the same tables. -- Downloading signature ... 99% *CRC FAILED* signature aborted
> By the way, be carefull. The pg_autovacuum is not good enough for being > using as the only way for cleaning the database, especially if you have a > lot of update and delete on the same tables. For right now, I'm just vacuuming more often to see how things go. I didn't think I was even generating enough transactions for it to be a problem, but a 'vacuum full' made a night-and-day difference. If it continues to be problematic, I'll probably look at an upgrade. Benny -- "Oh, the Jedis are going to feel this one!" -- Professor Farnsworth, "Futurama"
On Thu, 2004-06-17 at 00:46, Olivier Hubaut wrote: > On Wed, 16 Jun 2004 09:37:58 -0500 (CDT), C. Bensend > <benny@bennyvision.com> wrote: > > > > >> You might want to look into the autovacuum daemon, and / or increasing > >> fsm settings to be large enough to hold all the spare tuples released by > >> vacuuming. > > > > IIRC, the autovacuum stuff was added in 7.4, which I'm not running (yet). > > An upgrade to 7.4.3 might be prudent for me, while the database is still > > small. > > > > Benny > > > > > > By the way, be carefull. The pg_autovacuum is not good enough for being > using as the only way for cleaning the database, especially if you have a > lot of update and delete on the same tables. The problem here isn't pg_autovacuum, but too small of settings for fsm. I've run multi-day tests where autovacuum kept the size of the database pretty much the same with 200+ updates a second going on.
> The problem here isn't pg_autovacuum, but too small of settings for > fsm. I've run multi-day tests where autovacuum kept the size of the > database pretty much the same with 200+ updates a second going on. Hi Scott, Could you explain the fsm a little more? I have done _no_ tuning on my database simply because I don't know how, but I would be interested to hear your recommendations for the fsm settings. Thanks! Benny -- "Oh, the Jedis are going to feel this one!" -- Professor Farnsworth, "Futurama"
On Thu, 2004-06-17 at 09:57, C. Bensend wrote: > > The problem here isn't pg_autovacuum, but too small of settings for > > fsm. I've run multi-day tests where autovacuum kept the size of the > > database pretty much the same with 200+ updates a second going on. > > Hi Scott, > > Could you explain the fsm a little more? I have done _no_ tuning on > my database simply because I don't know how, but I would be interested > to hear your recommendations for the fsm settings. OK, Here's the basics of what fsm does. As you probably already know, PostgreSQL uses an in store multiple versioning system, where the first version of a tuple points to next newest version, and so on, until we get to the last version. So, if you update the same row four times, you'll have five copies of it in the data storage files, the original, plus four copies. Each transaction that needs to access said tuple will grab the first one, check the date stamp on it, and go on the the next one, until it finds the last valid one for it, based on the rules of visibility. We won't get into that right now, because what's important is that we have 5 versions of that tuple in memory. Now, after all the currently processing transactions finish, all the new transactions are going to be seeing the fifth version of the tuple from then on. But they'll all start by looking at the first one and working forward one at a time. Along comes vacuum and vacuum full. Vacuum full marks all these dead tuples, then actually compresses the data file back down to save the actual lost space. Plain (or lazy) vacuums simply mark the rows as free, without actually reclaiming them. The free space map is what vacuum uses to mark those tuples as free, and what the storage manager checks to find free space to place new tuples in the tables. So, the free space map needs to be big enough to hold a reference to every single freed row from vacuuming, or the vacuumed tuple space will not get reused, and the storage manager will simply append new tuples onto the end of the data file. You can find a bit more on this subject at: http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html
There is some information on www.varlena.com I think it may have been Scott who wrote the article that I'm thinking about. There is other information about, your best bet may be google. Regards Iain ----- Original Message ----- From: "C. Bensend" <benny@bennyvision.com> To: <pgsql-admin@postgresql.org> Sent: Friday, June 18, 2004 12:57 AM Subject: Re: [ADMIN] VARCHAR -vs- CHAR: huge performance difference? > > > The problem here isn't pg_autovacuum, but too small of settings for > > fsm. I've run multi-day tests where autovacuum kept the size of the > > database pretty much the same with 200+ updates a second going on. > > Hi Scott, > > Could you explain the fsm a little more? I have done _no_ tuning on > my database simply because I don't know how, but I would be interested > to hear your recommendations for the fsm settings. > > Thanks! > > Benny > > > -- > "Oh, the Jedis are going to feel this one!" -- Professor Farnsworth, > "Futurama" > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings