Thread: Table partitioning for maximum speed?
I'm sure this is a concept that's been explored here. I have a table (fairly simple, just two columns, one of which is a 32-digit checksum) with several million rows (currently, about 7 million). About a million times a day we do select * from my_table where md5 = ? to verify presence or absence of the row, and base further processing on that information. The idea bandied about now is to partition this table into 16 (or 256, or ...) chunks by first digit (or 2, or ...). In the simplest case, this would mean: create table my_table_0 as select * from my_table where md5 like '0%'; create table my_table_1 as select * from my_table where md5 like '1%'; ... create table my_table_f as select * from my_table where md5 like 'f%'; Then change the code to examine the checksum and create a query to the appropriate table based on the first digit. Obviously, this is conceptually similar to what the index on the "md5" column is supposed to do for us. However, partitioning moves just a little of the processing load off the database server and onto the machine running the application. That's important, because we can afford more application machines as load increases, but we can't as easily upgrade the database server. Will a query against a table of 0.5 million rows beat a query against a table of 7 million rows by a margin that makes it worth the hassle of supporting 15 "extra" tables? -- Jeff Boes vox 269.226.9550 ext 24 Database Engineer fax 269.349.9076 Nexcerpt, Inc. http://www.nexcerpt.com ...Nexcerpt... Extend your Expertise
On Thu, Oct 09, 2003 at 18:37:19 +0000, Jeff Boes <jboes@nexcerpt.com> wrote: > I'm sure this is a concept that's been explored here. I have a table > (fairly simple, just two columns, one of which is a 32-digit checksum) > with several million rows (currently, about 7 million). About a million > times a day we do > > select * from my_table where md5 = ? > > to verify presence or absence of the row, and base further processing on > that information. > > The idea bandied about now is to partition this table into 16 (or 256, > or ...) chunks by first digit (or 2, or ...). In the simplest case, this > would mean: If there is an index on the checksum column, then you shouldn't get much of a speed up by partitioning the data. If you don't have an index on the checksum, it sounds like you should.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > That's important, because we can afford more application > machines as load increases, but we can't as easily > upgrade the database server. Two ideas come to mind: One way to speed things up is to convert the entire checksum. Consider what a md5 checksum really is: a text string representing a hexadecimal number. Storing it as TEXT or CHAR is not as good as storing it as a number directly. Have your application convert it to a decimal number, and then store the checksum as type NUMERIC in the database. This gives an immediate speed boost. Next, use partial indexes to speed things up even further. How many partial indexes you want to create depends on your ratio of selects to updates, and how important each is to you. Some quick statistical analysis I did showed that for 10 indexes, the magic number is somewhere around 3.402 x 10 ^ 37. In other words: CREATE TABLE md5check (id SERIAL, md5 NUMERIC); CREATE INDEX md5_i0 ON md5check (md5) WHERE md5 <= 34000000000000000000000000000000000000; CREATE INDEX md5_i1 ON md5check (md5) WHERE md5 > 34000000000000000000000000000000000000 AND md5 <= 68000000000000000000000000000000000000; CREATE INDEX md5_i2 ON md5check (md5) WHERE md5 > 68000000000000000000000000000000000000 AND md5 <= 102000000000000000000000000000000000000; ... CREATE INDEX md5_i10 ON md5check (md5) WHERE md5 > 340000000000000000000000000000000000000; On my test table with 1/2 million rows, I saw a speed up from .16 msec (using TEXT only) to .09 msec. The more partial indexes you create, the faster things will go. Just remember to put the upper and lower boundary indexes in place to catch everything. Aside: if you are merely testing for the existence of the row, you can pull back a constant instead of the whole row: SELECT 1 FROM md5check WHERE md5 = ? Another way to speed things up is to break the checksum up into parts so that we can use one of the "normal" datatypes: specifically, BIGINT. Divide the 32 character checksum into four pieces, convert each piece to a decimal number, and store each in its own BIGINT column. The good news with this way is that you only need an index on one of the columns. Even at 7 million plus, the number of matches of 1/4 of the checksum characters is small enough to not need additional indexes. CREATE TABLE md5check (id SERIAL, md1 BIGINT, md2 BIGINT, md3 BIGINT, md4 BIGINT); CREATE INDEX md5_i1 ON md5check(md1); You can also add partial indexes to this as well, for maximum speed. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200310101135 -----BEGIN PGP SIGNATURE----- Comment: http://www.turnstep.com/pgp.html iD8DBQE/ht6DvJuQZxSWSsgRAjvyAJ9ndadWAgJIm84dc/kB8RABEIzIbwCg1UJL 2VUQeQU+LMgXnumOoMT6kWk= =PeUQ -----END PGP SIGNATURE-----
Please keep discussions on the list so that others may learn from or comment on the suggested solutions. On Fri, Oct 10, 2003 at 11:27:50 -0400, Jeff Boes <jboes@nexcerpt.com> wrote: > Bruno Wolff III wrote: > > >On Thu, Oct 09, 2003 at 18:37:19 +0000, > > Jeff Boes <jboes@nexcerpt.com> wrote: > > > > > >> > >>The idea bandied about now is to partition this table into 16 (or 256, > >>or ...) chunks by first digit (or 2, or ...). In the simplest case, this > >>would mean: > >> > >> > > > >If there is an index on the checksum column, then you shouldn't get > >much of a speed up by partitioning the data. > >If you don't have an index on the checksum, it sounds like you should. > > > > > Yes, the table has: > > Table "public.link_checksums" > Column | Type | Modifiers > ---------+---------------+----------- > md5 | character(32) | not null > link_id | integer | not null > Indexes: ix_link_checksums_pk primary key btree (md5) In that event I would expect that you might only save a few disk accesses by having a btree with fewer levels. If the query is slow, it might be doing a sequential search because of a type mismatch. You can use explain to double check what plan is being used.
Bruno Wolff III wrote: >On Fri, Oct 10, 2003 at 11:27:50 -0400, > Jeff Boes <jboes@nexcerpt.com> wrote: > > >>Yes, the table has: >> >> Table "public.link_checksums" >>Column | Type | Modifiers >>---------+---------------+----------- >>md5 | character(32) | not null >>link_id | integer | not null >>Indexes: ix_link_checksums_pk primary key btree (md5) >> >> > >In that event I would expect that you might only save a few disk accesses >by having a btree with fewer levels. > >If the query is slow, it might be doing a sequential search because of >a type mismatch. You can use explain to double check what plan is being >used. > > Actually, the query is *not* slow; but since we executing it a million times a day, any savings we can realize will add up in a hurry. For example, yesterday this query resulted in the following stats: 'count' => 814621, 'avg' => '0.009', 'time' => '7674.932' That is, we executed it 814,621 times, for a total (wallclock) time spent waiting of 7,674 seconds (obviously, we have multiple backends executing). So, even if we can cut this by only 0.004, that would result in a savings of almost an hour. So, again: will front-loading the work by mapping the original query to 16 (or 256) different queries by examining the first digit save us anything? (My colleague who came up with this idea thinks so, since the calculation will be done on a box other than the database host, and even one disk access saved per query would outweigh the calculation.) Will having 15 (or 255) additional tables make the cache behave differently? Is there overhead associated with having another 15 (or 255) tables? -- Jeff Boes vox 269.226.9550 ext 24 Database Engineer fax 269.349.9076 Nexcerpt, Inc. http://www.nexcerpt.com ...Nexcerpt... Extend your Expertise
On Fri, Oct 10, 2003 at 04:30:20PM -0000, greg@turnstep.com wrote: > One way to speed things up is to convert the entire checksum. Consider > what a md5 checksum really is: a text string representing a hexadecimal > number. Storing it as TEXT or CHAR is not as good as storing it as a > number directly. Have your application convert it to a decimal number, > and then store the checksum as type NUMERIC in the database. A subsequent idea is that with NUMERIC or other variable length fields you are wasting time, space and cache hits anyway. It would be probably faster to create a custom datatype, with fixed length for the exact size of an MD5 sum. With suitable input and output functions and all the operators you need, you will likely gain some additional performance boost. IIRC, Manfred Koizar developed a fixed-width char datatype for Shridar Daitankhar (sp?) maybe a year ago. It is probably a good starting point. Look for it in the pgsql-performance archives. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Endurecerse, pero jamás perder la ternura" (E. Guevara)
On Fri, Oct 10, 2003 at 13:40:17 -0400, Jeff Boes <jboes@nexcerpt.com> wrote: > > So, again: will front-loading the work by mapping the original query to > 16 (or 256) different queries by examining the first digit save us > anything? (My colleague who came up with this idea thinks so, since the > calculation will be done on a box other than the database host, and even > one disk access saved per query would outweigh the calculation.) This could potentially save you some disk accesses. If the index is mostly in cache now, you might not get a noticable benefit. > Will having 15 (or 255) additional tables make the cache behave > differently? Is there overhead associated with having another 15 (or > 255) tables? I am not sure whether or not you would use significantly more space by having multiple tables. The data format change suggested by someone else may be worth trying as well. In addition to their suggestions, you might experiment with keeping the hash in either 4 ints or 2 bigints. If you use bigints, you could probably just use an index on one of the bigints and have only a small chance of finding more than one row that matches.
>>>>> "JB" == Jeff Boes <jboes@nexcerpt.com> writes: JB> Will a query against a table of 0.5 million rows beat a query against JB> a table of 7 million rows by a margin that makes it worth the hassle JB> of supporting 15 "extra" tables? I think you'll be better off with a single table, as you won't have contention for the index pages in the cache. One thing to do is to reindex reasonably often (for PG < 7.4) to avoid index bloat, which will make them not fit in cache. Just check the size of your index in the pg_class table, and when it gets big, reindex (assuming you do lots of updates/inserts to the table). Your table splitting solution sounds like something I'd do if I were forced to use mysql ;-) -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Vivek Khera, Ph.D. Khera Communications, Inc. Internet: khera@kciLink.com Rockville, MD +1-240-453-8497 AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/
Bruno Wolff III <bruno@wolff.to> writes: > The data format change suggested by someone else may be worth trying > as well. In addition to their suggestions, you might experiment with > keeping the hash in either 4 ints or 2 bigints. If you use bigints, > you could probably just use an index on one of the bigints and have > only a small chance of finding more than one row that matches. This sounds good to me too. You would have to experiment to see if the 4x int format is faster than the 2x bigint or vice versa. I suspect the 4x int format is way faster, if you have few enough collisions (like single digit) it would probably be the best. Using native fixed-size datatypes that fit in a Datum and have assembly instructions for comparison should be a big win over a variable sized datatype that has to be dereferenced from a pointer and then put through complex functions to handle comparison. -- greg
Bruno Wolff III wrote: > The data format change suggested by someone else may be worth trying > as well. In addition to their suggestions, you might experiment with > keeping the hash in either 4 ints or 2 bigints. If you use bigints, > you could probably just use an index on one of the bigints and have > only a small chance of finding more than one row that matches. > This is an interesting idea. Alternatively just use bytea and store the 16 bytes directly (i.e. no hex or base64 encoding). There is b-tree index support for bytea. Joe
Jeff Boes wrote: > Obviously, this is conceptually similar to what the index on the "md5" > column is supposed to do for us. However, partitioning moves just a > little of the processing load off the database server and onto the > machine running the application. That's important, because we can afford > more application machines as load increases, but we can't as easily > upgrade the database server. > > Will a query against a table of 0.5 million rows beat a query against a > table of 7 million rows by a margin that makes it worth the hassle of > supporting 15 "extra" tables? > I don't think 16 tables on the same server will help, but if you already have your app tier physically separate from the database tier, you could partition your data to more than one database server based on the first byte of the md5 column. I designed and built something similar a few years ago. We never got to the point where we really needed that kind of scalability, but it worked pretty well in (limited) testing. Joe
BULL. How many times does PG have to scan the whole table because of MVCC? At least with partitioning there is a fighting chance that that won't be necessary. Queries that involve the field on which the table is partitioned execute faster by an order of magnitude. It also helps with vaccuming as PG can vaccum only one partition at a time. I have 17M row table where all records get frequently updated over a year. I would do my own partitioning with inheritance if it was not broken. Partitioning would be a BIG plus in my book. So would visibility of records but that is another fight. JLL Vivek Khera wrote: > > >>>>> "JB" == Jeff Boes <jboes@nexcerpt.com> writes: > > JB> Will a query against a table of 0.5 million rows beat a query against > JB> a table of 7 million rows by a margin that makes it worth the hassle > JB> of supporting 15 "extra" tables? > > I think you'll be better off with a single table, as you won't have > contention for the index pages in the cache. > > One thing to do is to reindex reasonably often (for PG < 7.4) to avoid > index bloat, which will make them not fit in cache. Just check the > size of your index in the pg_class table, and when it gets big, > reindex (assuming you do lots of updates/inserts to the table). > > Your table splitting solution sounds like something I'd do if I were > forced to use mysql ;-) > > -- > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Vivek Khera, Ph.D. Khera Communications, Inc. > Internet: khera@kciLink.com Rockville, MD +1-240-453-8497 > AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/ > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
Jean-Luc Lachance wrote: > > BULL. > > How many times does PG have to scan the whole table because of MVCC? > At least with partitioning there is a fighting chance that that won't be > necessary. > Queries that involve the field on which the table is partitioned execute > faster by an order of magnitude. > It also helps with vaccuming as PG can vaccum only one partition at a > time. > I have 17M row table where all records get frequently updated over a > year. > I would do my own partitioning with inheritance if it was not broken. > Partitioning would be a BIG plus in my book. So would visibility of > records but that is another fight. I meant to say visibility of record in the index. > > JLL > > Vivek Khera wrote: > > > > >>>>> "JB" == Jeff Boes <jboes@nexcerpt.com> writes: > > > > JB> Will a query against a table of 0.5 million rows beat a query against > > JB> a table of 7 million rows by a margin that makes it worth the hassle > > JB> of supporting 15 "extra" tables? > > > > I think you'll be better off with a single table, as you won't have > > contention for the index pages in the cache. > > > > One thing to do is to reindex reasonably often (for PG < 7.4) to avoid > > index bloat, which will make them not fit in cache. Just check the > > size of your index in the pg_class table, and when it gets big, > > reindex (assuming you do lots of updates/inserts to the table). > > > > Your table splitting solution sounds like something I'd do if I were > > forced to use mysql ;-) > > > > -- > > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > > Vivek Khera, Ph.D. Khera Communications, Inc. > > Internet: khera@kciLink.com Rockville, MD +1-240-453-8497 > > AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/ > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: the planner will ignore your desire to choose an index scan if your > > joining column's datatypes do not match > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend
Is this partitioning like the schemas mentioned here: http://www.postgresql.org/docs/current/static/ddl-schemas.html? Would those help and increase performance? /B ----- Original Message ----- From: "Jean-Luc Lachance" <jllachan@nsd.ca> To: "Vivek Khera" <khera@kcilink.com> Cc: <pgsql-general@postgresql.org> Sent: Friday, October 10, 2003 14:23 Subject: Re: [GENERAL] Table partitioning for maximum speed? > BULL. > > How many times does PG have to scan the whole table because of MVCC? > At least with partitioning there is a fighting chance that that won't be > necessary. > Queries that involve the field on which the table is partitioned execute > faster by an order of magnitude. > It also helps with vaccuming as PG can vaccum only one partition at a > time. > I have 17M row table where all records get frequently updated over a > year. > I would do my own partitioning with inheritance if it was not broken. > Partitioning would be a BIG plus in my book. So would visibility of > records but that is another fight. > > JLL > > Vivek Khera wrote: > > > > >>>>> "JB" == Jeff Boes <jboes@nexcerpt.com> writes: > > > > JB> Will a query against a table of 0.5 million rows beat a query against > > JB> a table of 7 million rows by a margin that makes it worth the hassle > > JB> of supporting 15 "extra" tables? > > > > I think you'll be better off with a single table, as you won't have > > contention for the index pages in the cache. > > > > One thing to do is to reindex reasonably often (for PG < 7.4) to avoid > > index bloat, which will make them not fit in cache. Just check the > > size of your index in the pg_class table, and when it gets big, > > reindex (assuming you do lots of updates/inserts to the table). > > > > Your table splitting solution sounds like something I'd do if I were > > forced to use mysql ;-) > > > > -- > > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > > Vivek Khera, Ph.D. Khera Communications, Inc. > > Internet: khera@kciLink.com Rockville, MD +1-240-453-8497 > > AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/ > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: the planner will ignore your desire to choose an index scan if your > > joining column's datatypes do not match > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend
>>>>> "JL" == Jean-Luc Lachance <jllachan@nsd.ca> writes: JL> BULL. JL> How many times does PG have to scan the whole table because of MVCC? JL> At least with partitioning there is a fighting chance that that won't be JL> necessary. Huh? His specific query was "WHERE md5 = '....'". Why on earth would that force a sequence scan if it were an indexed column? Heck, the btree index should rule out 15/16ths of the rows after the first character comparison.
No. There is a big difference between schemas and partitioned tables. The closest thing would be a bunch of tables enherited from a base table. Each tables having a common field all the same for a specific table. The problem is that the planner cannot be aware of this and make better use of the implicit key. JLL David Busby wrote: > > Is this partitioning like the schemas mentioned here: > http://www.postgresql.org/docs/current/static/ddl-schemas.html? Would those > help and increase performance? > > /B > > ----- Original Message ----- > From: "Jean-Luc Lachance" <jllachan@nsd.ca> > To: "Vivek Khera" <khera@kcilink.com> > Cc: <pgsql-general@postgresql.org> > Sent: Friday, October 10, 2003 14:23 > Subject: Re: [GENERAL] Table partitioning for maximum speed? > > > BULL. > > > > How many times does PG have to scan the whole table because of MVCC? > > At least with partitioning there is a fighting chance that that won't be > > necessary. > > Queries that involve the field on which the table is partitioned execute > > faster by an order of magnitude. > > It also helps with vaccuming as PG can vaccum only one partition at a > > time. > > I have 17M row table where all records get frequently updated over a > > year. > > I would do my own partitioning with inheritance if it was not broken. > > Partitioning would be a BIG plus in my book. So would visibility of > > records but that is another fight. > > > > JLL > > > > Vivek Khera wrote: > > > > > > >>>>> "JB" == Jeff Boes <jboes@nexcerpt.com> writes: > > > > > > JB> Will a query against a table of 0.5 million rows beat a query > against > > > JB> a table of 7 million rows by a margin that makes it worth the hassle > > > JB> of supporting 15 "extra" tables? > > > > > > I think you'll be better off with a single table, as you won't have > > > contention for the index pages in the cache. > > > > > > One thing to do is to reindex reasonably often (for PG < 7.4) to avoid > > > index bloat, which will make them not fit in cache. Just check the > > > size of your index in the pg_class table, and when it gets big, > > > reindex (assuming you do lots of updates/inserts to the table). > > > > > > Your table splitting solution sounds like something I'd do if I were > > > forced to use mysql ;-) > > > > > > -- > > > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > > > Vivek Khera, Ph.D. Khera Communications, Inc. > > > Internet: khera@kciLink.com Rockville, MD +1-240-453-8497 > > > AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/ > > > > > > ---------------------------(end of broadcast)--------------------------- > > > TIP 9: the planner will ignore your desire to choose an index scan if > your > > > joining column's datatypes do not match > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 8: explain analyze is your friend > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
I was replying to the general comment that PG cannot profit from having patitioned tables. Vivek Khera wrote: > > >>>>> "JL" == Jean-Luc Lachance <jllachan@nsd.ca> writes: > > JL> BULL. > JL> How many times does PG have to scan the whole table because of MVCC? > JL> At least with partitioning there is a fighting chance that that won't be > JL> necessary. > > Huh? His specific query was "WHERE md5 = '....'". Why on earth would > that force a sequence scan if it were an indexed column? Heck, the > btree index should rule out 15/16ths of the rows after the first > character comparison. > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org