Thread: Partioning tsearch2 a table into chunks and accessing via views
Hi, I have an application which loads millions of NEW documents each month into a PostgreSQL tsearch2 table. I have the initial version completed and searching performance is great but my problem is that each time a new month rolls around I have to drop all the indexes do a COPY and re-index the entire table. This is problematic considering that each month takes longer than the previous to rebuild the indexes and the application in unavailable during the rebuilding process. In order to avoid the re-indexing I was thinking of instead creating a new table each month (building its indexes and etc) and accessing them all through a view. This way I only have to index the new data each month. Does this work? Does a view with N tables make it N times slower for tsearch2 queries? Is there a better solution? Benjamin
Benjamin, > > In order to avoid the re-indexing I was thinking of instead creating a > new > table each month (building its indexes and etc) and accessing them all > through a view. This way I only have to index the new data each month. > Take a look at bizgres.org (based on postgres). They have a parent-child structure. The way i use it, is I have about 30M records a day that are inserted into the database. Each day is a "child" table to the "parent". so example: the parent table is called "logfile" each day, is a child, with the structure like "logfile_YYMMDD" the "child" inherits the table structure of the parent, such that you could query the child table name directly, or you run the query against the parent (ie. logfile table) and get all the data. the indexes are done on a per table basis, so new data that comes in, is a lesser amount, and doesn't require re-indexing. example: select * from logfile_070825 where datafield = 'foo' if i knew i wanted to specifically go into that child, or: select * from logfile where datafield = 'foo' and all child tables are searched and results merged. You can perform any kind of sql query and field structures are you normally do. the downside is that the queries are run sequentially. so if you had 100 child tables, each table is queried via indexes, then results are merged. but, this approach does allow me to dump alot of data in, without having the re-indexing issues you are facing. at some point, you could roll up the days, in to weekly child tables, then monthly tables, etc. I believe Bizgres has a new version of their system that does parallel queries which would certainly speed things up. For your documents, you can do it by the day it was checked in, or maybe you have some other way of logically grouping, but the parent/child table structure really helped to solve my problem of adding in millions of records each day. The closest thing in mysql is using merge tables, which is not really practical when it comes time to do the joins to the tables. -brandon http://www.t3report.com - marketing intelligence for online marketing and affiliate programs
Re: [GENERAL] Partioning tsearch2 a table into chunks and accessing via views
From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Brandon Shalton wrote: > Benjamin, > > >> >> In order to avoid the re-indexing I was thinking of instead creating >> a new >> table each month (building its indexes and etc) and accessing them all >> through a view. This way I only have to index the new data each month. >> > > Take a look at bizgres.org (based on postgres). > > They have a parent-child structure. > > The way i use it, is I have about 30M records a day that are inserted > into the database. > > Each day is a "child" table to the "parent". > > so example: > > the parent table is called "logfile" > > each day, is a child, with the structure like "logfile_YYMMDD" > > the "child" inherits the table structure of the parent, such that you > could query the child table name directly, or you run the query against > the parent (ie. logfile table) and get all the data. > > the indexes are done on a per table basis, so new data that comes in, is > a lesser amount, and doesn't require re-indexing. PostgreSQL can do all of this too. Sincerely, Joshua D. Drake > > > example: > > select * from logfile_070825 where datafield = 'foo' > > if i knew i wanted to specifically go into that child, or: > > select * from logfile where datafield = 'foo' > > and all child tables are searched and results merged. You can perform > any kind of sql query and field structures are you normally do. > > the downside is that the queries are run sequentially. > > so if you had 100 child tables, each table is queried via indexes, then > results are merged. > > but, this approach does allow me to dump alot of data in, without having > the re-indexing issues you are facing. > > at some point, you could roll up the days, in to weekly child tables, > then monthly tables, etc. > > I believe Bizgres has a new version of their system that does parallel > queries which would certainly speed things up. > > For your documents, you can do it by the day it was checked in, or maybe > you have some other way of logically grouping, but the parent/child > table structure really helped to solve my problem of adding in millions > of records each day. > > The closest thing in mysql is using merge tables, which is not really > practical when it comes time to do the joins to the tables. > > -brandon > > http://www.t3report.com - marketing intelligence for online marketing > and affiliate programs > > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > - -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/ UNIQUE NOT NULL Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGz4XuATb/zqfZUUQRAukhAJ9b2x4PLPZsoPmtm3O/Ze4AobDXngCgq+rl X2j2ePDyjYxRajfGCVmjnYU= =pIjb -----END PGP SIGNATURE-----
Re: [GENERAL] Partioning tsearch2 a table into chunks and accessing via views
From
Benjamin Arai
Date:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This kind of disappointing, I was hoping there was more that could be done. There has to be another way to do incremental indexing without loosing that much performance. Benjamin On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Brandon Shalton wrote: >> Benjamin, >> >> >>> >>> In order to avoid the re-indexing I was thinking of instead creating >>> a new >>> table each month (building its indexes and etc) and accessing >>> them all >>> through a view. This way I only have to index the new data each >>> month. >>> >> >> Take a look at bizgres.org (based on postgres). >> >> They have a parent-child structure. >> >> The way i use it, is I have about 30M records a day that are inserted >> into the database. >> >> Each day is a "child" table to the "parent". >> >> so example: >> >> the parent table is called "logfile" >> >> each day, is a child, with the structure like "logfile_YYMMDD" >> >> the "child" inherits the table structure of the parent, such that you >> could query the child table name directly, or you run the query >> against >> the parent (ie. logfile table) and get all the data. >> >> the indexes are done on a per table basis, so new data that comes >> in, is >> a lesser amount, and doesn't require re-indexing. > > > PostgreSQL can do all of this too. > > Sincerely, > > Joshua D. Drake > >> >> >> example: >> >> select * from logfile_070825 where datafield = 'foo' >> >> if i knew i wanted to specifically go into that child, or: >> >> select * from logfile where datafield = 'foo' >> >> and all child tables are searched and results merged. You can >> perform >> any kind of sql query and field structures are you normally do. >> >> the downside is that the queries are run sequentially. >> >> so if you had 100 child tables, each table is queried via indexes, >> then >> results are merged. >> >> but, this approach does allow me to dump alot of data in, without >> having >> the re-indexing issues you are facing. >> >> at some point, you could roll up the days, in to weekly child tables, >> then monthly tables, etc. >> >> I believe Bizgres has a new version of their system that does >> parallel >> queries which would certainly speed things up. >> >> For your documents, you can do it by the day it was checked in, or >> maybe >> you have some other way of logically grouping, but the parent/child >> table structure really helped to solve my problem of adding in >> millions >> of records each day. >> >> The closest thing in mysql is using merge tables, which is not really >> practical when it comes time to do the joins to the tables. >> >> -brandon >> >> http://www.t3report.com - marketing intelligence for online marketing >> and affiliate programs >> >> >> >> >> >> >> >> >> >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 6: explain analyze is your friend >> > > > - -- > > === The PostgreSQL Company: Command Prompt, Inc. === > Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 > PostgreSQL solutions since 1997 http://www.commandprompt.com/ > UNIQUE NOT NULL > Donate to the PostgreSQL Project: http://www.postgresql.org/about/ > donate > PostgreSQL Replication: http://www.commandprompt.com/products/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFGz4XuATb/zqfZUUQRAukhAJ9b2x4PLPZsoPmtm3O/Ze4AobDXngCgq+rl > X2j2ePDyjYxRajfGCVmjnYU= > =pIjb > -----END PGP SIGNATURE----- > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQIVAwUBRs+/UfyqRf6YpodNAQL6Xg//eEqR0UQ4I/snn7Dtmkru40jCuECGeG8g XoxLWEa+bumVgwrEYbjKTBp3KP6OEKz9VV4xHQROTtqxh+rg0hdoc0kWxSyquCm8 VljL24ykvBmRmjhacwi8FKp092zwRcLrbkzTxIr90q8u008aVPWxQCBtmfL6QVTv I9AyN0kb00ypx+B9I2ySugYzBerVCMUiKUeXplHWn1loSSm1w+5CzXY8gtvivFEV YspS1Fk2rxjnjlPE/FTGUiwJrdWZTJrd3BuSVbH5DWBoCjz9gzq0NyNZAtESWX2H oGwlWBEJNFTtoHnK4iTMS+CzKHQQQZ9ZuQcHy84SlXYUo9n0/NCIeabu2xaj44Fs LFq8jBCH3ebAkD/hQOgk1H05ljbfX8A/u2zz75W1NbD0xTB/sAljWqhypz2x7pOo sUJF9MQ7DwVG8JitUAAc5fuGpLLR4WxF68YdkgycaCNknP7IATeD2ecqJkC26Av+ GHHci2ct5ypVq9Qq8OuesYSox7XpO2+E+Y5DtgBo+/R7eOJRLA3Z0FDXFLGsdFxy 0OKoew1MN79jP+KMZFJwvddH/TrkZBdIKlkacXYwUHU3c1ATwne6WteKTnEmr2aP 99oQgfmNDyQgTeEL20jokF4YZOdm1UO3Cc7wTi2QlwyqUDbUmYtWzgbS9QbnaGGA 58XdVacGznw= =Hst4 -----END PGP SIGNATURE-----
Re: [GENERAL] Partioning tsearch2 a table into chunks and accessing via views
From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Benjamin Arai wrote: > This kind of disappointing, I was hoping there was more that could be done. > > There has to be another way to do incremental indexing without loosing > that much performance. What makes you think you are loosing performance by using partitioning? Joshua D. Drake > > Benjamin > > On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 > >> Brandon Shalton wrote: >>> Benjamin, >>> >>> >>>> >>>> In order to avoid the re-indexing I was thinking of instead creating >>>> a new >>>> table each month (building its indexes and etc) and accessing them all >>>> through a view. This way I only have to index the new data each month. >>>> >>> >>> Take a look at bizgres.org (based on postgres). >>> >>> They have a parent-child structure. >>> >>> The way i use it, is I have about 30M records a day that are inserted >>> into the database. >>> >>> Each day is a "child" table to the "parent". >>> >>> so example: >>> >>> the parent table is called "logfile" >>> >>> each day, is a child, with the structure like "logfile_YYMMDD" >>> >>> the "child" inherits the table structure of the parent, such that you >>> could query the child table name directly, or you run the query against >>> the parent (ie. logfile table) and get all the data. >>> >>> the indexes are done on a per table basis, so new data that comes in, is >>> a lesser amount, and doesn't require re-indexing. > > >> PostgreSQL can do all of this too. > >> Sincerely, > >> Joshua D. Drake > >>> >>> >>> example: >>> >>> select * from logfile_070825 where datafield = 'foo' >>> >>> if i knew i wanted to specifically go into that child, or: >>> >>> select * from logfile where datafield = 'foo' >>> >>> and all child tables are searched and results merged. You can perform >>> any kind of sql query and field structures are you normally do. >>> >>> the downside is that the queries are run sequentially. >>> >>> so if you had 100 child tables, each table is queried via indexes, then >>> results are merged. >>> >>> but, this approach does allow me to dump alot of data in, without having >>> the re-indexing issues you are facing. >>> >>> at some point, you could roll up the days, in to weekly child tables, >>> then monthly tables, etc. >>> >>> I believe Bizgres has a new version of their system that does parallel >>> queries which would certainly speed things up. >>> >>> For your documents, you can do it by the day it was checked in, or maybe >>> you have some other way of logically grouping, but the parent/child >>> table structure really helped to solve my problem of adding in millions >>> of records each day. >>> >>> The closest thing in mysql is using merge tables, which is not really >>> practical when it comes time to do the joins to the tables. >>> >>> -brandon >>> >>> http://www.t3report.com - marketing intelligence for online marketing >>> and affiliate programs >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> ---------------------------(end of broadcast)--------------------------- >>> TIP 6: explain analyze is your friend >>> > > >> - -- > >> === The PostgreSQL Company: Command Prompt, Inc. === >> Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 >> PostgreSQL solutions since 1997 http://www.commandprompt.com/ >> UNIQUE NOT NULL >> Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate >> PostgreSQL Replication: http://www.commandprompt.com/products/ > >> - ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend - -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/ UNIQUE NOT NULL Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG0FZKATb/zqfZUUQRAsfRAJ4mjQP+1ltG7pqLFQ+Ru52LA5e7XACcDqKr PIihth2x3gx3qTEI8WfWNjo= =AhJx -----END PGP SIGNATURE-----
Re: [GENERAL] Partioning tsearch2 a table into chunks and accessing via views
From
Benjamin Arai
Date:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 As stated in the previous email if I use partitioning then queries will be executed sequentially - i.e., instead of log(n) it would be (# partitions) * log(n). Right? Benjamin On Aug 25, 2007, at 9:18 AM, Joshua D. Drake wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Benjamin Arai wrote: >> This kind of disappointing, I was hoping there was more that could >> be done. >> >> There has to be another way to do incremental indexing without >> loosing >> that much performance. > > What makes you think you are loosing performance by using > partitioning? > > Joshua D. Drake > >> >> Benjamin >> >> On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote: >> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >> >>> Brandon Shalton wrote: >>>> Benjamin, >>>> >>>> >>>>> >>>>> In order to avoid the re-indexing I was thinking of instead >>>>> creating >>>>> a new >>>>> table each month (building its indexes and etc) and accessing >>>>> them all >>>>> through a view. This way I only have to index the new data each >>>>> month. >>>>> >>>> >>>> Take a look at bizgres.org (based on postgres). >>>> >>>> They have a parent-child structure. >>>> >>>> The way i use it, is I have about 30M records a day that are >>>> inserted >>>> into the database. >>>> >>>> Each day is a "child" table to the "parent". >>>> >>>> so example: >>>> >>>> the parent table is called "logfile" >>>> >>>> each day, is a child, with the structure like "logfile_YYMMDD" >>>> >>>> the "child" inherits the table structure of the parent, such >>>> that you >>>> could query the child table name directly, or you run the query >>>> against >>>> the parent (ie. logfile table) and get all the data. >>>> >>>> the indexes are done on a per table basis, so new data that >>>> comes in, is >>>> a lesser amount, and doesn't require re-indexing. >> >> >>> PostgreSQL can do all of this too. >> >>> Sincerely, >> >>> Joshua D. Drake >> >>>> >>>> >>>> example: >>>> >>>> select * from logfile_070825 where datafield = 'foo' >>>> >>>> if i knew i wanted to specifically go into that child, or: >>>> >>>> select * from logfile where datafield = 'foo' >>>> >>>> and all child tables are searched and results merged. You can >>>> perform >>>> any kind of sql query and field structures are you normally do. >>>> >>>> the downside is that the queries are run sequentially. >>>> >>>> so if you had 100 child tables, each table is queried via >>>> indexes, then >>>> results are merged. >>>> >>>> but, this approach does allow me to dump alot of data in, >>>> without having >>>> the re-indexing issues you are facing. >>>> >>>> at some point, you could roll up the days, in to weekly child >>>> tables, >>>> then monthly tables, etc. >>>> >>>> I believe Bizgres has a new version of their system that does >>>> parallel >>>> queries which would certainly speed things up. >>>> >>>> For your documents, you can do it by the day it was checked in, >>>> or maybe >>>> you have some other way of logically grouping, but the parent/child >>>> table structure really helped to solve my problem of adding in >>>> millions >>>> of records each day. >>>> >>>> The closest thing in mysql is using merge tables, which is not >>>> really >>>> practical when it comes time to do the joins to the tables. >>>> >>>> -brandon >>>> >>>> http://www.t3report.com - marketing intelligence for online >>>> marketing >>>> and affiliate programs >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ---------------------------(end of >>>> broadcast)--------------------------- >>>> TIP 6: explain analyze is your friend >>>> >> >> >>> - -- >> >>> === The PostgreSQL Company: Command Prompt, Inc. === >>> Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 >>> PostgreSQL solutions since 1997 http://www.commandprompt.com/ >>> UNIQUE NOT NULL >>> Donate to the PostgreSQL Project: http://www.postgresql.org/about/ >>> donate >>> PostgreSQL Replication: http://www.commandprompt.com/products/ >> >>> > > > - ---------------------------(end of > broadcast)--------------------------- > TIP 6: explain analyze is your friend > > > > - -- > > === The PostgreSQL Company: Command Prompt, Inc. === > Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 > PostgreSQL solutions since 1997 http://www.commandprompt.com/ > UNIQUE NOT NULL > Donate to the PostgreSQL Project: http://www.postgresql.org/about/ > donate > PostgreSQL Replication: http://www.commandprompt.com/products/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG0FZKATb/zqfZUUQRAsfRAJ4mjQP+1ltG7pqLFQ+Ru52LA5e7XACcDqKr > PIihth2x3gx3qTEI8WfWNjo= > =AhJx > -----END PGP SIGNATURE----- > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQIVAwUBRtBxK/yqRf6YpodNAQKVkhAAgF4DaXeMxplX1EUXZMuw9aqr+75NxNcp ZOCJPSFN0jwzY3MlFCRVjL1kzXmRJB4L3fE2xVQX9reY62TPfYC8m/xatey1X6nc RdfNb9IzL6OyAghcpnUnwYntQtmGRpJtS7LQrx/SiDz8LWIp2S5v3Q9S8alKNTUS FupCNy1bL3yJf9tySSvol6JSH2edVt8f48J1j03f5B9zh+G/rKrQ+muuKOHyU3mb cVJ+gbSWCesuo+9rfaJ24m2ODwZm/YA+ENhlc3EOvD8z+cYn2OjuvAqvHABRsEKe +E9NWBPK/7UT4/T4B/LcBW1B6VISFqyETkwe2fhY5kVZnF+f0KtQIxXh/9qMsnnh tWthI9YmG4MIBmCsJwdneABHdfMJDp8IlawXqMlX4VkPHUrUtiQV/oDNsHMrU8BM SZOK5m0ADgXk0rndkEWXhERsyuFaocFj+snvaJEVH9PJSDVgjo7EMW5Qfo6p3NFg ujBurhLaSuj52vClbdOs3lYp0Drbuf9iQnot3pD4XsCKAOTQm3S7BvgKMd5FUHLX HBFn4KiSRGx7hwlrss4rjqJ8BoJKbtvGxyNSiwZkrAOke+gqEML6pPdvlAj3Dif8 KrsKcEu/cuR8euqX9IYCZIw4GYLqgs3mewfQIt5bSfw3yHvFyOgolyUeYfnYYlbr +u145pL2KZc= =T4dg -----END PGP SIGNATURE-----
Re: [GENERAL] Partioning tsearch2 a table into chunks and accessing via views
From
"Brandon Shalton"
Date:
> > As stated in the previous email if I use partitioning then queries will > be executed sequentially - i.e., instead of log(n) it would be (# > partitions) * log(n). Right? > depends.. since indexes would be hit for each child table, the time for query is dependent on the amount of data that is indexed in each table. the querying of the parent is still pretty quick given dual processor and a fast array filestorage device. given your situation, i would give the parent/child approach a child. I haven't checked in postgres if it is has it has Joshua had replied, but I do know bizgres does as i have been running this configuration for the last 3 years and it solved my problem of importing 30-60M records in a day and still being able to query the database for data. -brandon http://www.t3report.com - marketing intelligence for online marketing and affiliate programs
Re: [GENERAL] Partioning tsearch2 a table into chunks and accessing via views
From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Benjamin Arai wrote: > As stated in the previous email if I use partitioning then queries will > be executed sequentially - i.e., instead of log(n) it would be (# > partitions) * log(n). Right? The planner will consider every relevant partition during the execution. Which may be a performance hit, it may not be. It depends on many factors. In general however, partitioning when done correctly is a performance benefit and a maintenance benefit. Sincerely, Joshua D. Drake > > Benjamin > > On Aug 25, 2007, at 9:18 AM, Joshua D. Drake wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 > >> Benjamin Arai wrote: >>> This kind of disappointing, I was hoping there was more that could be >>> done. >>> >>> There has to be another way to do incremental indexing without loosing >>> that much performance. > >> What makes you think you are loosing performance by using partitioning? > >> Joshua D. Drake > >>> >>> Benjamin >>> >>> On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote: >>> >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>> >>>> Brandon Shalton wrote: >>>>> Benjamin, >>>>> >>>>> >>>>>> >>>>>> In order to avoid the re-indexing I was thinking of instead creating >>>>>> a new >>>>>> table each month (building its indexes and etc) and accessing them >>>>>> all >>>>>> through a view. This way I only have to index the new data each >>>>>> month. >>>>>> >>>>> >>>>> Take a look at bizgres.org (based on postgres). >>>>> >>>>> They have a parent-child structure. >>>>> >>>>> The way i use it, is I have about 30M records a day that are inserted >>>>> into the database. >>>>> >>>>> Each day is a "child" table to the "parent". >>>>> >>>>> so example: >>>>> >>>>> the parent table is called "logfile" >>>>> >>>>> each day, is a child, with the structure like "logfile_YYMMDD" >>>>> >>>>> the "child" inherits the table structure of the parent, such that you >>>>> could query the child table name directly, or you run the query >>>>> against >>>>> the parent (ie. logfile table) and get all the data. >>>>> >>>>> the indexes are done on a per table basis, so new data that comes >>>>> in, is >>>>> a lesser amount, and doesn't require re-indexing. >>> >>> >>>> PostgreSQL can do all of this too. >>> >>>> Sincerely, >>> >>>> Joshua D. Drake >>> >>>>> >>>>> >>>>> example: >>>>> >>>>> select * from logfile_070825 where datafield = 'foo' >>>>> >>>>> if i knew i wanted to specifically go into that child, or: >>>>> >>>>> select * from logfile where datafield = 'foo' >>>>> >>>>> and all child tables are searched and results merged. You can perform >>>>> any kind of sql query and field structures are you normally do. >>>>> >>>>> the downside is that the queries are run sequentially. >>>>> >>>>> so if you had 100 child tables, each table is queried via indexes, >>>>> then >>>>> results are merged. >>>>> >>>>> but, this approach does allow me to dump alot of data in, without >>>>> having >>>>> the re-indexing issues you are facing. >>>>> >>>>> at some point, you could roll up the days, in to weekly child tables, >>>>> then monthly tables, etc. >>>>> >>>>> I believe Bizgres has a new version of their system that does parallel >>>>> queries which would certainly speed things up. >>>>> >>>>> For your documents, you can do it by the day it was checked in, or >>>>> maybe >>>>> you have some other way of logically grouping, but the parent/child >>>>> table structure really helped to solve my problem of adding in >>>>> millions >>>>> of records each day. >>>>> >>>>> The closest thing in mysql is using merge tables, which is not really >>>>> practical when it comes time to do the joins to the tables. >>>>> >>>>> -brandon >>>>> >>>>> http://www.t3report.com - marketing intelligence for online marketing >>>>> and affiliate programs >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ---------------------------(end of >>>>> broadcast)--------------------------- >>>>> TIP 6: explain analyze is your friend >>>>> >>> >>> >>>> - -- >>> >>>> === The PostgreSQL Company: Command Prompt, Inc. === >>>> Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 >>>> PostgreSQL solutions since 1997 http://www.commandprompt.com/ >>>> UNIQUE NOT NULL >>>> Donate to the PostgreSQL Project: >>>> http://www.postgresql.org/about/donate >>>> PostgreSQL Replication: http://www.commandprompt.com/products/ >>> >>>> > > >> - ---------------------------(end of >> broadcast)--------------------------- >> TIP 6: explain analyze is your friend > > > >> - -- > >> === The PostgreSQL Company: Command Prompt, Inc. === >> Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 >> PostgreSQL solutions since 1997 http://www.commandprompt.com/ >> UNIQUE NOT NULL >> Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate >> PostgreSQL Replication: http://www.commandprompt.com/products/ > >> - -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/ UNIQUE NOT NULL Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG0HJ5ATb/zqfZUUQRAuEdAJwNwsr/XCsr85tElSVbRVMUHME+PACglbJK gj5cZgOtgEEjUPph0jpsOcw= =u7Ox -----END PGP SIGNATURE-----
Re: [GENERAL] Partioning tsearch2 a table into chunks and accessing via views
From
Benjamin Arai
Date:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Since I am using tsearch2 on the table I think there is going to be a significant performance hit - e.g., I partition by batch (batches are not separated by date, they are essentially random subsets of a much larger data-set). I am querying this database using tsearch2, so I am assuming all tables are going to be queried each time since the text is not partition by any specific constraint - e.g., >R goes to table 1 and <=R goes to table 2. Benjamin On Aug 25, 2007, at 11:18 AM, Joshua D. Drake wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Benjamin Arai wrote: >> As stated in the previous email if I use partitioning then queries >> will >> be executed sequentially - i.e., instead of log(n) it would be (# >> partitions) * log(n). Right? > > The planner will consider every relevant partition during the > execution. > Which may be a performance hit, it may not be. It depends on many > factors. In general however, partitioning when done correctly is a > performance benefit and a maintenance benefit. > > Sincerely, > > Joshua D. Drake > > >> >> Benjamin >> >> On Aug 25, 2007, at 9:18 AM, Joshua D. Drake wrote: >> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >> >>> Benjamin Arai wrote: >>>> This kind of disappointing, I was hoping there was more that >>>> could be >>>> done. >>>> >>>> There has to be another way to do incremental indexing without >>>> loosing >>>> that much performance. >> >>> What makes you think you are loosing performance by using >>> partitioning? >> >>> Joshua D. Drake >> >>>> >>>> Benjamin >>>> >>>> On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote: >>>> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>> >>>>> Brandon Shalton wrote: >>>>>> Benjamin, >>>>>> >>>>>> >>>>>>> >>>>>>> In order to avoid the re-indexing I was thinking of instead >>>>>>> creating >>>>>>> a new >>>>>>> table each month (building its indexes and etc) and accessing >>>>>>> them >>>>>>> all >>>>>>> through a view. This way I only have to index the new data each >>>>>>> month. >>>>>>> >>>>>> >>>>>> Take a look at bizgres.org (based on postgres). >>>>>> >>>>>> They have a parent-child structure. >>>>>> >>>>>> The way i use it, is I have about 30M records a day that are >>>>>> inserted >>>>>> into the database. >>>>>> >>>>>> Each day is a "child" table to the "parent". >>>>>> >>>>>> so example: >>>>>> >>>>>> the parent table is called "logfile" >>>>>> >>>>>> each day, is a child, with the structure like "logfile_YYMMDD" >>>>>> >>>>>> the "child" inherits the table structure of the parent, such >>>>>> that you >>>>>> could query the child table name directly, or you run the query >>>>>> against >>>>>> the parent (ie. logfile table) and get all the data. >>>>>> >>>>>> the indexes are done on a per table basis, so new data that comes >>>>>> in, is >>>>>> a lesser amount, and doesn't require re-indexing. >>>> >>>> >>>>> PostgreSQL can do all of this too. >>>> >>>>> Sincerely, >>>> >>>>> Joshua D. Drake >>>> >>>>>> >>>>>> >>>>>> example: >>>>>> >>>>>> select * from logfile_070825 where datafield = 'foo' >>>>>> >>>>>> if i knew i wanted to specifically go into that child, or: >>>>>> >>>>>> select * from logfile where datafield = 'foo' >>>>>> >>>>>> and all child tables are searched and results merged. You can >>>>>> perform >>>>>> any kind of sql query and field structures are you normally do. >>>>>> >>>>>> the downside is that the queries are run sequentially. >>>>>> >>>>>> so if you had 100 child tables, each table is queried via >>>>>> indexes, >>>>>> then >>>>>> results are merged. >>>>>> >>>>>> but, this approach does allow me to dump alot of data in, without >>>>>> having >>>>>> the re-indexing issues you are facing. >>>>>> >>>>>> at some point, you could roll up the days, in to weekly child >>>>>> tables, >>>>>> then monthly tables, etc. >>>>>> >>>>>> I believe Bizgres has a new version of their system that does >>>>>> parallel >>>>>> queries which would certainly speed things up. >>>>>> >>>>>> For your documents, you can do it by the day it was checked >>>>>> in, or >>>>>> maybe >>>>>> you have some other way of logically grouping, but the parent/ >>>>>> child >>>>>> table structure really helped to solve my problem of adding in >>>>>> millions >>>>>> of records each day. >>>>>> >>>>>> The closest thing in mysql is using merge tables, which is not >>>>>> really >>>>>> practical when it comes time to do the joins to the tables. >>>>>> >>>>>> -brandon >>>>>> >>>>>> http://www.t3report.com - marketing intelligence for online >>>>>> marketing >>>>>> and affiliate programs >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ---------------------------(end of >>>>>> broadcast)--------------------------- >>>>>> TIP 6: explain analyze is your friend >>>>>> >>>> >>>> >>>>> - -- >>>> >>>>> === The PostgreSQL Company: Command Prompt, Inc. === >>>>> Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 >>>>> PostgreSQL solutions since 1997 http://www.commandprompt.com/ >>>>> UNIQUE NOT NULL >>>>> Donate to the PostgreSQL Project: >>>>> http://www.postgresql.org/about/donate >>>>> PostgreSQL Replication: http://www.commandprompt.com/products/ >>>> >>>>> >> >> >>> - ---------------------------(end of >>> broadcast)--------------------------- >>> TIP 6: explain analyze is your friend >> >> >> >>> - -- >> >>> === The PostgreSQL Company: Command Prompt, Inc. === >>> Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 >>> PostgreSQL solutions since 1997 http://www.commandprompt.com/ >>> UNIQUE NOT NULL >>> Donate to the PostgreSQL Project: http://www.postgresql.org/about/ >>> donate >>> PostgreSQL Replication: http://www.commandprompt.com/products/ >> >>> > > > - -- > > === The PostgreSQL Company: Command Prompt, Inc. === > Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 > PostgreSQL solutions since 1997 http://www.commandprompt.com/ > UNIQUE NOT NULL > Donate to the PostgreSQL Project: http://www.postgresql.org/about/ > donate > PostgreSQL Replication: http://www.commandprompt.com/products/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFG0HJ5ATb/zqfZUUQRAuEdAJwNwsr/XCsr85tElSVbRVMUHME+PACglbJK > gj5cZgOtgEEjUPph0jpsOcw= > =u7Ox > -----END PGP SIGNATURE----- > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin) iQIVAwUBRtCHvPyqRf6YpodNAQLDuBAAp+dg1MHZy+hjZ8Zk9OQTbeADgJqGPpi9 G7+y3iyaaqOF66TC52P7OaqO6nPhoNCNMCxwztnASyxpftGD5yJ4AZTSGcbAsWB9 cO5mE1mgbngZNPnRLypeJ81hyE6bniNNL7xXSq9LB8wRMczFZwGVZT66+lMIFjvv 0OrbAcSNUFqddky7EFm8gx6A2FNIzSdFB0dNbzpKwEOTnCHKvh+O99sAr/LB7mmL Hj/wzeQKrWbDAB3+N9rczivZ03DvYAGbUY9qBfNj7Y9YL3iu/Q+Oy4bHtI6d/a7B wepol2xe1sYEtQ+R3yMPXFte0483n8XIdXxa412ZSIEBfLxHzV6M7JTbPtgWwE+9 7xvyYbO7xQL9N/P8ZGg75eEqXtUrepGmJG0Y30qF5sNdMG0pWoz1bzDjSLNCnylq JwsO8p1EHNPnPRqotwZZSfLUW16eREqLaOrSC84gIw5Q6zAMZe/k2ckzzHKPGB1c sckaQROcgK4Lu9ywjRjBjNqclOMasf0MCrsDVMQE/wnh4GoDL/PAyEOqnlpvJ+cx k4kmOrEz5GRZQehHUI7CdejFwZ32sAB+nV2r8zDW9FSxgoRoFvtE2hooJ9orv0IU 1F8TeBdifVP/Ef/lHAHs6IqEH45y72WqrWFZsIdU1PDe0MyfgMaOBwdwXNeZqky/ IF5SMKbl9yA= =F9Oq -----END PGP SIGNATURE-----
On Aug 24, 2007, at 7:41 PM, Benjamin Arai wrote: > Hi, > > I have an application which loads millions of NEW documents each month > into a PostgreSQL tsearch2 table. I have the initial version > completed > and searching performance is great but my problem is that each time > a new > month rolls around I have to drop all the indexes do a COPY and re- > index > the entire table. This is problematic considering that each month > takes > longer than the previous to rebuild the indexes and the application in > unavailable during the rebuilding process. > > In order to avoid the re-indexing I was thinking of instead > creating a new > table each month (building its indexes and etc) and accessing them all > through a view. This way I only have to index the new data each month. > > Does this work? Does a view with N tables make it N times slower for > tsearch2 queries? Is there a better solution? You can use Postgres's inheritance mechanism for your partitioning mechanism and combine it with constraint exclusion to avoid the N^2 issues. See: http://www.postgresql.org/docs/8.2/interactive/ddl-inherit.html and http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html Basically, create a table from which all of your partitioned tables inherit. Partition in such a way that you can use constraint exclusion and then you can treat the parent table like the view you were suggesting. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On Aug 25, 2007, at 2:58 PM, Erik Jones wrote: > > On Aug 24, 2007, at 7:41 PM, Benjamin Arai wrote: > >> Hi, >> >> I have an application which loads millions of NEW documents each >> month >> into a PostgreSQL tsearch2 table. I have the initial version >> completed >> and searching performance is great but my problem is that each >> time a new >> month rolls around I have to drop all the indexes do a COPY and re- >> index >> the entire table. This is problematic considering that each month >> takes >> longer than the previous to rebuild the indexes and the >> application in >> unavailable during the rebuilding process. >> >> In order to avoid the re-indexing I was thinking of instead >> creating a new >> table each month (building its indexes and etc) and accessing them >> all >> through a view. This way I only have to index the new data each >> month. >> >> Does this work? Does a view with N tables make it N times slower for >> tsearch2 queries? Is there a better solution? > > > You can use Postgres's inheritance mechanism for your partitioning > mechanism and combine it with constraint exclusion to avoid the N^2 > issues. See: > > http://www.postgresql.org/docs/8.2/interactive/ddl-inherit.html > > and > > http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html > > Basically, create a table from which all of your partitioned tables > inherit. Partition in such a way that you can use constraint > exclusion and then you can treat the parent table like the view you > were suggesting. > > Erik Jones > Sorry, I didn't see that you had crossposted and carried the conversation on another list. Please, don't do that. Avoid the top posting, as well. They both make it difficult for others to join in or follow the conversations and issues. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com