Thread: Partioning tsearch2 a table into chunks and accessing via views

Partioning tsearch2 a table into chunks and accessing via views

From
"Benjamin Arai"
Date:
Hi,

I have an application which loads millions of NEW documents each month
into a PostgreSQL tsearch2 table.  I have the initial version completed
and searching performance is great but my problem is that each time a
new
month rolls around I have to drop all the indexes do a COPY and re-index
the entire table. This is problematic considering that each month takes
longer than the previous to rebuild the indexes and the application in
unavailable during the rebuilding process.

In order to avoid the re-indexing I was thinking of instead creating
a new
table each month (building its indexes and etc) and accessing them all
through a view. This way I only have to index the new data each month.

Does this work?  Does a view with N tables make it N times slower for
tsearch2 queries? Is there a better solution?

Benjamin

Re: [PERFORM] Partioning tsearch2 a table into chunks and accessing via views

From
"Brandon Shalton"
Date:
Benjamin,


>
> In order to avoid the re-indexing I was thinking of instead creating  a
> new
> table each month (building its indexes and etc) and accessing them all
> through a view. This way I only have to index the new data each month.
>

Take a look at bizgres.org (based on postgres).

They have a parent-child structure.

The way i use it, is I have about 30M records a day that are inserted into
the database.

Each day is a "child" table to the "parent".

so example:

the parent table is called  "logfile"

each day, is a child, with the structure like  "logfile_YYMMDD"

the "child" inherits the table structure of the parent, such that you could
query the child table name directly, or you run the query against the parent
(ie. logfile table) and get all the data.

the indexes are done on a per table basis, so new data that comes in, is a
lesser amount, and doesn't require re-indexing.


example:

select * from logfile_070825 where datafield = 'foo'

if i knew i wanted to specifically go into that child, or:

select * from logfile where datafield = 'foo'

and all child tables are searched and results merged.  You can perform any
kind of sql query and field structures are you normally do.

the downside is that the queries are run sequentially.

so if you had 100 child tables, each table is queried via indexes, then
results are merged.

but, this approach does allow me to dump alot of data in, without having the
re-indexing issues you are facing.

at some point, you could roll up the days, in to weekly child tables, then
monthly tables, etc.

I believe Bizgres has a new version of their system that does parallel
queries which would certainly speed things up.

For your documents, you can do it by the day it was checked in, or maybe you
have some other way of logically grouping, but the parent/child table
structure really helped to solve my problem of adding in millions of records
each day.

The closest thing in mysql is using merge tables, which is not really
practical when it comes time to do the joins to the tables.

-brandon

http://www.t3report.com - marketing intelligence for online marketing and
affiliate programs










Re: [PERFORM] Partioning tsearch2 a table into chunks and accessing via views

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Brandon Shalton wrote:
> Benjamin,
>
>
>>
>> In order to avoid the re-indexing I was thinking of instead creating
>> a new
>> table each month (building its indexes and etc) and accessing them all
>> through a view. This way I only have to index the new data each month.
>>
>
> Take a look at bizgres.org (based on postgres).
>
> They have a parent-child structure.
>
> The way i use it, is I have about 30M records a day that are inserted
> into the database.
>
> Each day is a "child" table to the "parent".
>
> so example:
>
> the parent table is called  "logfile"
>
> each day, is a child, with the structure like  "logfile_YYMMDD"
>
> the "child" inherits the table structure of the parent, such that you
> could query the child table name directly, or you run the query against
> the parent (ie. logfile table) and get all the data.
>
> the indexes are done on a per table basis, so new data that comes in, is
> a lesser amount, and doesn't require re-indexing.


PostgreSQL can do all of this too.

Sincerely,

Joshua D. Drake

>
>
> example:
>
> select * from logfile_070825 where datafield = 'foo'
>
> if i knew i wanted to specifically go into that child, or:
>
> select * from logfile where datafield = 'foo'
>
> and all child tables are searched and results merged.  You can perform
> any kind of sql query and field structures are you normally do.
>
> the downside is that the queries are run sequentially.
>
> so if you had 100 child tables, each table is queried via indexes, then
> results are merged.
>
> but, this approach does allow me to dump alot of data in, without having
> the re-indexing issues you are facing.
>
> at some point, you could roll up the days, in to weekly child tables,
> then monthly tables, etc.
>
> I believe Bizgres has a new version of their system that does parallel
> queries which would certainly speed things up.
>
> For your documents, you can do it by the day it was checked in, or maybe
> you have some other way of logically grouping, but the parent/child
> table structure really helped to solve my problem of adding in millions
> of records each day.
>
> The closest thing in mysql is using merge tables, which is not really
> practical when it comes time to do the joins to the tables.
>
> -brandon
>
> http://www.t3report.com - marketing intelligence for online marketing
> and affiliate programs
>
>
>
>
>
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


- --

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
            UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGz4XuATb/zqfZUUQRAukhAJ9b2x4PLPZsoPmtm3O/Ze4AobDXngCgq+rl
X2j2ePDyjYxRajfGCVmjnYU=
=pIjb
-----END PGP SIGNATURE-----

Re: [PERFORM] Partioning tsearch2 a table into chunks and accessing via views

From
Benjamin Arai
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This kind of disappointing, I was hoping there was more that could be
done.

There has to be another way to do incremental indexing without
loosing that much performance.

Benjamin

On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Brandon Shalton wrote:
>> Benjamin,
>>
>>
>>>
>>> In order to avoid the re-indexing I was thinking of instead creating
>>> a new
>>> table each month (building its indexes and etc) and accessing
>>> them all
>>> through a view. This way I only have to index the new data each
>>> month.
>>>
>>
>> Take a look at bizgres.org (based on postgres).
>>
>> They have a parent-child structure.
>>
>> The way i use it, is I have about 30M records a day that are inserted
>> into the database.
>>
>> Each day is a "child" table to the "parent".
>>
>> so example:
>>
>> the parent table is called  "logfile"
>>
>> each day, is a child, with the structure like  "logfile_YYMMDD"
>>
>> the "child" inherits the table structure of the parent, such that you
>> could query the child table name directly, or you run the query
>> against
>> the parent (ie. logfile table) and get all the data.
>>
>> the indexes are done on a per table basis, so new data that comes
>> in, is
>> a lesser amount, and doesn't require re-indexing.
>
>
> PostgreSQL can do all of this too.
>
> Sincerely,
>
> Joshua D. Drake
>
>>
>>
>> example:
>>
>> select * from logfile_070825 where datafield = 'foo'
>>
>> if i knew i wanted to specifically go into that child, or:
>>
>> select * from logfile where datafield = 'foo'
>>
>> and all child tables are searched and results merged.  You can
>> perform
>> any kind of sql query and field structures are you normally do.
>>
>> the downside is that the queries are run sequentially.
>>
>> so if you had 100 child tables, each table is queried via indexes,
>> then
>> results are merged.
>>
>> but, this approach does allow me to dump alot of data in, without
>> having
>> the re-indexing issues you are facing.
>>
>> at some point, you could roll up the days, in to weekly child tables,
>> then monthly tables, etc.
>>
>> I believe Bizgres has a new version of their system that does
>> parallel
>> queries which would certainly speed things up.
>>
>> For your documents, you can do it by the day it was checked in, or
>> maybe
>> you have some other way of logically grouping, but the parent/child
>> table structure really helped to solve my problem of adding in
>> millions
>> of records each day.
>>
>> The closest thing in mysql is using merge tables, which is not really
>> practical when it comes time to do the joins to the tables.
>>
>> -brandon
>>
>> http://www.t3report.com - marketing intelligence for online marketing
>> and affiliate programs
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 6: explain analyze is your friend
>>
>
>
> - --
>
>       === The PostgreSQL Company: Command Prompt, Inc. ===
> Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
> PostgreSQL solutions since 1997  http://www.commandprompt.com/
>             UNIQUE NOT NULL
> Donate to the PostgreSQL Project: http://www.postgresql.org/about/
> donate
> PostgreSQL Replication: http://www.commandprompt.com/products/
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFGz4XuATb/zqfZUUQRAukhAJ9b2x4PLPZsoPmtm3O/Ze4AobDXngCgq+rl
> X2j2ePDyjYxRajfGCVmjnYU=
> =pIjb
> -----END PGP SIGNATURE-----
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQIVAwUBRs+/UfyqRf6YpodNAQL6Xg//eEqR0UQ4I/snn7Dtmkru40jCuECGeG8g
XoxLWEa+bumVgwrEYbjKTBp3KP6OEKz9VV4xHQROTtqxh+rg0hdoc0kWxSyquCm8
VljL24ykvBmRmjhacwi8FKp092zwRcLrbkzTxIr90q8u008aVPWxQCBtmfL6QVTv
I9AyN0kb00ypx+B9I2ySugYzBerVCMUiKUeXplHWn1loSSm1w+5CzXY8gtvivFEV
YspS1Fk2rxjnjlPE/FTGUiwJrdWZTJrd3BuSVbH5DWBoCjz9gzq0NyNZAtESWX2H
oGwlWBEJNFTtoHnK4iTMS+CzKHQQQZ9ZuQcHy84SlXYUo9n0/NCIeabu2xaj44Fs
LFq8jBCH3ebAkD/hQOgk1H05ljbfX8A/u2zz75W1NbD0xTB/sAljWqhypz2x7pOo
sUJF9MQ7DwVG8JitUAAc5fuGpLLR4WxF68YdkgycaCNknP7IATeD2ecqJkC26Av+
GHHci2ct5ypVq9Qq8OuesYSox7XpO2+E+Y5DtgBo+/R7eOJRLA3Z0FDXFLGsdFxy
0OKoew1MN79jP+KMZFJwvddH/TrkZBdIKlkacXYwUHU3c1ATwne6WteKTnEmr2aP
99oQgfmNDyQgTeEL20jokF4YZOdm1UO3Cc7wTi2QlwyqUDbUmYtWzgbS9QbnaGGA
58XdVacGznw=
=Hst4
-----END PGP SIGNATURE-----

Re: [PERFORM] Partioning tsearch2 a table into chunks and accessing via views

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Benjamin Arai wrote:
> This kind of disappointing, I was hoping there was more that could be done.
>
> There has to be another way to do incremental indexing without loosing
> that much performance.

What makes you think you are loosing performance by using partitioning?

Joshua D. Drake

>
> Benjamin
>
> On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>
>> Brandon Shalton wrote:
>>> Benjamin,
>>>
>>>
>>>>
>>>> In order to avoid the re-indexing I was thinking of instead creating
>>>> a new
>>>> table each month (building its indexes and etc) and accessing them all
>>>> through a view. This way I only have to index the new data each month.
>>>>
>>>
>>> Take a look at bizgres.org (based on postgres).
>>>
>>> They have a parent-child structure.
>>>
>>> The way i use it, is I have about 30M records a day that are inserted
>>> into the database.
>>>
>>> Each day is a "child" table to the "parent".
>>>
>>> so example:
>>>
>>> the parent table is called  "logfile"
>>>
>>> each day, is a child, with the structure like  "logfile_YYMMDD"
>>>
>>> the "child" inherits the table structure of the parent, such that you
>>> could query the child table name directly, or you run the query against
>>> the parent (ie. logfile table) and get all the data.
>>>
>>> the indexes are done on a per table basis, so new data that comes in, is
>>> a lesser amount, and doesn't require re-indexing.
>
>
>> PostgreSQL can do all of this too.
>
>> Sincerely,
>
>> Joshua D. Drake
>
>>>
>>>
>>> example:
>>>
>>> select * from logfile_070825 where datafield = 'foo'
>>>
>>> if i knew i wanted to specifically go into that child, or:
>>>
>>> select * from logfile where datafield = 'foo'
>>>
>>> and all child tables are searched and results merged.  You can perform
>>> any kind of sql query and field structures are you normally do.
>>>
>>> the downside is that the queries are run sequentially.
>>>
>>> so if you had 100 child tables, each table is queried via indexes, then
>>> results are merged.
>>>
>>> but, this approach does allow me to dump alot of data in, without having
>>> the re-indexing issues you are facing.
>>>
>>> at some point, you could roll up the days, in to weekly child tables,
>>> then monthly tables, etc.
>>>
>>> I believe Bizgres has a new version of their system that does parallel
>>> queries which would certainly speed things up.
>>>
>>> For your documents, you can do it by the day it was checked in, or maybe
>>> you have some other way of logically grouping, but the parent/child
>>> table structure really helped to solve my problem of adding in millions
>>> of records each day.
>>>
>>> The closest thing in mysql is using merge tables, which is not really
>>> practical when it comes time to do the joins to the tables.
>>>
>>> -brandon
>>>
>>> http://www.t3report.com - marketing intelligence for online marketing
>>> and affiliate programs
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 6: explain analyze is your friend
>>>
>
>
>> - --
>
>>       === The PostgreSQL Company: Command Prompt, Inc. ===
>> Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
>> PostgreSQL solutions since 1997  http://www.commandprompt.com/
>>             UNIQUE NOT NULL
>> Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
>> PostgreSQL Replication: http://www.commandprompt.com/products/
>
>>


- ---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend



- --

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
            UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG0FZKATb/zqfZUUQRAsfRAJ4mjQP+1ltG7pqLFQ+Ru52LA5e7XACcDqKr
PIihth2x3gx3qTEI8WfWNjo=
=AhJx
-----END PGP SIGNATURE-----

Re: [PERFORM] Partioning tsearch2 a table into chunks and accessing via views

From
Benjamin Arai
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

As stated in the previous email if I use partitioning then queries
will be executed sequentially - i.e., instead of log(n) it would be
(# partitions) * log(n).  Right?

Benjamin

On Aug 25, 2007, at 9:18 AM, Joshua D. Drake wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Benjamin Arai wrote:
>> This kind of disappointing, I was hoping there was more that could
>> be done.
>>
>> There has to be another way to do incremental indexing without
>> loosing
>> that much performance.
>
> What makes you think you are loosing performance by using
> partitioning?
>
> Joshua D. Drake
>
>>
>> Benjamin
>>
>> On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>
>>> Brandon Shalton wrote:
>>>> Benjamin,
>>>>
>>>>
>>>>>
>>>>> In order to avoid the re-indexing I was thinking of instead
>>>>> creating
>>>>> a new
>>>>> table each month (building its indexes and etc) and accessing
>>>>> them all
>>>>> through a view. This way I only have to index the new data each
>>>>> month.
>>>>>
>>>>
>>>> Take a look at bizgres.org (based on postgres).
>>>>
>>>> They have a parent-child structure.
>>>>
>>>> The way i use it, is I have about 30M records a day that are
>>>> inserted
>>>> into the database.
>>>>
>>>> Each day is a "child" table to the "parent".
>>>>
>>>> so example:
>>>>
>>>> the parent table is called  "logfile"
>>>>
>>>> each day, is a child, with the structure like  "logfile_YYMMDD"
>>>>
>>>> the "child" inherits the table structure of the parent, such
>>>> that you
>>>> could query the child table name directly, or you run the query
>>>> against
>>>> the parent (ie. logfile table) and get all the data.
>>>>
>>>> the indexes are done on a per table basis, so new data that
>>>> comes in, is
>>>> a lesser amount, and doesn't require re-indexing.
>>
>>
>>> PostgreSQL can do all of this too.
>>
>>> Sincerely,
>>
>>> Joshua D. Drake
>>
>>>>
>>>>
>>>> example:
>>>>
>>>> select * from logfile_070825 where datafield = 'foo'
>>>>
>>>> if i knew i wanted to specifically go into that child, or:
>>>>
>>>> select * from logfile where datafield = 'foo'
>>>>
>>>> and all child tables are searched and results merged.  You can
>>>> perform
>>>> any kind of sql query and field structures are you normally do.
>>>>
>>>> the downside is that the queries are run sequentially.
>>>>
>>>> so if you had 100 child tables, each table is queried via
>>>> indexes, then
>>>> results are merged.
>>>>
>>>> but, this approach does allow me to dump alot of data in,
>>>> without having
>>>> the re-indexing issues you are facing.
>>>>
>>>> at some point, you could roll up the days, in to weekly child
>>>> tables,
>>>> then monthly tables, etc.
>>>>
>>>> I believe Bizgres has a new version of their system that does
>>>> parallel
>>>> queries which would certainly speed things up.
>>>>
>>>> For your documents, you can do it by the day it was checked in,
>>>> or maybe
>>>> you have some other way of logically grouping, but the parent/child
>>>> table structure really helped to solve my problem of adding in
>>>> millions
>>>> of records each day.
>>>>
>>>> The closest thing in mysql is using merge tables, which is not
>>>> really
>>>> practical when it comes time to do the joins to the tables.
>>>>
>>>> -brandon
>>>>
>>>> http://www.t3report.com - marketing intelligence for online
>>>> marketing
>>>> and affiliate programs
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 6: explain analyze is your friend
>>>>
>>
>>
>>> - --
>>
>>>       === The PostgreSQL Company: Command Prompt, Inc. ===
>>> Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
>>> PostgreSQL solutions since 1997  http://www.commandprompt.com/
>>>             UNIQUE NOT NULL
>>> Donate to the PostgreSQL Project: http://www.postgresql.org/about/
>>> donate
>>> PostgreSQL Replication: http://www.commandprompt.com/products/
>>
>>>
>
>
> - ---------------------------(end of
> broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
>
>
> - --
>
>       === The PostgreSQL Company: Command Prompt, Inc. ===
> Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
> PostgreSQL solutions since 1997  http://www.commandprompt.com/
>             UNIQUE NOT NULL
> Donate to the PostgreSQL Project: http://www.postgresql.org/about/
> donate
> PostgreSQL Replication: http://www.commandprompt.com/products/
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFG0FZKATb/zqfZUUQRAsfRAJ4mjQP+1ltG7pqLFQ+Ru52LA5e7XACcDqKr
> PIihth2x3gx3qTEI8WfWNjo=
> =AhJx
> -----END PGP SIGNATURE-----
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQIVAwUBRtBxK/yqRf6YpodNAQKVkhAAgF4DaXeMxplX1EUXZMuw9aqr+75NxNcp
ZOCJPSFN0jwzY3MlFCRVjL1kzXmRJB4L3fE2xVQX9reY62TPfYC8m/xatey1X6nc
RdfNb9IzL6OyAghcpnUnwYntQtmGRpJtS7LQrx/SiDz8LWIp2S5v3Q9S8alKNTUS
FupCNy1bL3yJf9tySSvol6JSH2edVt8f48J1j03f5B9zh+G/rKrQ+muuKOHyU3mb
cVJ+gbSWCesuo+9rfaJ24m2ODwZm/YA+ENhlc3EOvD8z+cYn2OjuvAqvHABRsEKe
+E9NWBPK/7UT4/T4B/LcBW1B6VISFqyETkwe2fhY5kVZnF+f0KtQIxXh/9qMsnnh
tWthI9YmG4MIBmCsJwdneABHdfMJDp8IlawXqMlX4VkPHUrUtiQV/oDNsHMrU8BM
SZOK5m0ADgXk0rndkEWXhERsyuFaocFj+snvaJEVH9PJSDVgjo7EMW5Qfo6p3NFg
ujBurhLaSuj52vClbdOs3lYp0Drbuf9iQnot3pD4XsCKAOTQm3S7BvgKMd5FUHLX
HBFn4KiSRGx7hwlrss4rjqJ8BoJKbtvGxyNSiwZkrAOke+gqEML6pPdvlAj3Dif8
KrsKcEu/cuR8euqX9IYCZIw4GYLqgs3mewfQIt5bSfw3yHvFyOgolyUeYfnYYlbr
+u145pL2KZc=
=T4dg
-----END PGP SIGNATURE-----

Re: [PERFORM] Partioning tsearch2 a table into chunks and accessing via views

From
"Brandon Shalton"
Date:
>
> As stated in the previous email if I use partitioning then queries  will
> be executed sequentially - i.e., instead of log(n) it would be  (#
> partitions) * log(n).  Right?
>

depends.. since indexes would be hit for each child table, the time for
query is dependent on the amount of data that is indexed in each table.

the querying of the parent is still pretty quick given dual processor and a
fast array filestorage device.

given your situation, i would give the parent/child approach a child.  I
haven't checked in postgres if it is has it has Joshua had replied, but I do
know bizgres does as i have been running this configuration for the last 3
years and it solved my problem of importing 30-60M records in a day and
still being able to query the database for data.

-brandon

http://www.t3report.com - marketing intelligence for online marketing and
affiliate programs


Re: [PERFORM] Partioning tsearch2 a table into chunks and accessing via views

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Benjamin Arai wrote:
> As stated in the previous email if I use partitioning then queries will
> be executed sequentially - i.e., instead of log(n) it would be (#
> partitions) * log(n).  Right?

The planner will consider every relevant partition during the execution.
Which may be a performance hit, it may not be. It depends on many
factors. In general however, partitioning when done correctly is a
performance benefit and a maintenance benefit.

Sincerely,

Joshua D. Drake


>
> Benjamin
>
> On Aug 25, 2007, at 9:18 AM, Joshua D. Drake wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>
>> Benjamin Arai wrote:
>>> This kind of disappointing, I was hoping there was more that could be
>>> done.
>>>
>>> There has to be another way to do incremental indexing without loosing
>>> that much performance.
>
>> What makes you think you are loosing performance by using partitioning?
>
>> Joshua D. Drake
>
>>>
>>> Benjamin
>>>
>>> On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote:
>>>
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>
>>>> Brandon Shalton wrote:
>>>>> Benjamin,
>>>>>
>>>>>
>>>>>>
>>>>>> In order to avoid the re-indexing I was thinking of instead creating
>>>>>> a new
>>>>>> table each month (building its indexes and etc) and accessing them
>>>>>> all
>>>>>> through a view. This way I only have to index the new data each
>>>>>> month.
>>>>>>
>>>>>
>>>>> Take a look at bizgres.org (based on postgres).
>>>>>
>>>>> They have a parent-child structure.
>>>>>
>>>>> The way i use it, is I have about 30M records a day that are inserted
>>>>> into the database.
>>>>>
>>>>> Each day is a "child" table to the "parent".
>>>>>
>>>>> so example:
>>>>>
>>>>> the parent table is called  "logfile"
>>>>>
>>>>> each day, is a child, with the structure like  "logfile_YYMMDD"
>>>>>
>>>>> the "child" inherits the table structure of the parent, such that you
>>>>> could query the child table name directly, or you run the query
>>>>> against
>>>>> the parent (ie. logfile table) and get all the data.
>>>>>
>>>>> the indexes are done on a per table basis, so new data that comes
>>>>> in, is
>>>>> a lesser amount, and doesn't require re-indexing.
>>>
>>>
>>>> PostgreSQL can do all of this too.
>>>
>>>> Sincerely,
>>>
>>>> Joshua D. Drake
>>>
>>>>>
>>>>>
>>>>> example:
>>>>>
>>>>> select * from logfile_070825 where datafield = 'foo'
>>>>>
>>>>> if i knew i wanted to specifically go into that child, or:
>>>>>
>>>>> select * from logfile where datafield = 'foo'
>>>>>
>>>>> and all child tables are searched and results merged.  You can perform
>>>>> any kind of sql query and field structures are you normally do.
>>>>>
>>>>> the downside is that the queries are run sequentially.
>>>>>
>>>>> so if you had 100 child tables, each table is queried via indexes,
>>>>> then
>>>>> results are merged.
>>>>>
>>>>> but, this approach does allow me to dump alot of data in, without
>>>>> having
>>>>> the re-indexing issues you are facing.
>>>>>
>>>>> at some point, you could roll up the days, in to weekly child tables,
>>>>> then monthly tables, etc.
>>>>>
>>>>> I believe Bizgres has a new version of their system that does parallel
>>>>> queries which would certainly speed things up.
>>>>>
>>>>> For your documents, you can do it by the day it was checked in, or
>>>>> maybe
>>>>> you have some other way of logically grouping, but the parent/child
>>>>> table structure really helped to solve my problem of adding in
>>>>> millions
>>>>> of records each day.
>>>>>
>>>>> The closest thing in mysql is using merge tables, which is not really
>>>>> practical when it comes time to do the joins to the tables.
>>>>>
>>>>> -brandon
>>>>>
>>>>> http://www.t3report.com - marketing intelligence for online marketing
>>>>> and affiliate programs
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------(end of
>>>>> broadcast)---------------------------
>>>>> TIP 6: explain analyze is your friend
>>>>>
>>>
>>>
>>>> - --
>>>
>>>>       === The PostgreSQL Company: Command Prompt, Inc. ===
>>>> Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
>>>> PostgreSQL solutions since 1997  http://www.commandprompt.com/
>>>>             UNIQUE NOT NULL
>>>> Donate to the PostgreSQL Project:
>>>> http://www.postgresql.org/about/donate
>>>> PostgreSQL Replication: http://www.commandprompt.com/products/
>>>
>>>>
>
>
>> - ---------------------------(end of
>> broadcast)---------------------------
>> TIP 6: explain analyze is your friend
>
>
>
>> - --
>
>>       === The PostgreSQL Company: Command Prompt, Inc. ===
>> Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
>> PostgreSQL solutions since 1997  http://www.commandprompt.com/
>>             UNIQUE NOT NULL
>> Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
>> PostgreSQL Replication: http://www.commandprompt.com/products/
>
>>


- --

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
            UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG0HJ5ATb/zqfZUUQRAuEdAJwNwsr/XCsr85tElSVbRVMUHME+PACglbJK
gj5cZgOtgEEjUPph0jpsOcw=
=u7Ox
-----END PGP SIGNATURE-----

Re: [PERFORM] Partioning tsearch2 a table into chunks and accessing via views

From
Benjamin Arai
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Since I am using tsearch2 on the table I think there is going to be a
significant performance hit - e.g., I partition by batch (batches are
not separated by date, they are essentially random subsets of a much
larger data-set).  I am querying this database using tsearch2, so I
am assuming all tables are going to be queried each time since the
text is not partition by any specific constraint - e.g., >R goes to
table 1 and <=R goes to table 2.

Benjamin

On Aug 25, 2007, at 11:18 AM, Joshua D. Drake wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Benjamin Arai wrote:
>> As stated in the previous email if I use partitioning then queries
>> will
>> be executed sequentially - i.e., instead of log(n) it would be (#
>> partitions) * log(n).  Right?
>
> The planner will consider every relevant partition during the
> execution.
> Which may be a performance hit, it may not be. It depends on many
> factors. In general however, partitioning when done correctly is a
> performance benefit and a maintenance benefit.
>
> Sincerely,
>
> Joshua D. Drake
>
>
>>
>> Benjamin
>>
>> On Aug 25, 2007, at 9:18 AM, Joshua D. Drake wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>
>>> Benjamin Arai wrote:
>>>> This kind of disappointing, I was hoping there was more that
>>>> could be
>>>> done.
>>>>
>>>> There has to be another way to do incremental indexing without
>>>> loosing
>>>> that much performance.
>>
>>> What makes you think you are loosing performance by using
>>> partitioning?
>>
>>> Joshua D. Drake
>>
>>>>
>>>> Benjamin
>>>>
>>>> On Aug 24, 2007, at 6:29 PM, Joshua D. Drake wrote:
>>>>
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>
>>>>> Brandon Shalton wrote:
>>>>>> Benjamin,
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> In order to avoid the re-indexing I was thinking of instead
>>>>>>> creating
>>>>>>> a new
>>>>>>> table each month (building its indexes and etc) and accessing
>>>>>>> them
>>>>>>> all
>>>>>>> through a view. This way I only have to index the new data each
>>>>>>> month.
>>>>>>>
>>>>>>
>>>>>> Take a look at bizgres.org (based on postgres).
>>>>>>
>>>>>> They have a parent-child structure.
>>>>>>
>>>>>> The way i use it, is I have about 30M records a day that are
>>>>>> inserted
>>>>>> into the database.
>>>>>>
>>>>>> Each day is a "child" table to the "parent".
>>>>>>
>>>>>> so example:
>>>>>>
>>>>>> the parent table is called  "logfile"
>>>>>>
>>>>>> each day, is a child, with the structure like  "logfile_YYMMDD"
>>>>>>
>>>>>> the "child" inherits the table structure of the parent, such
>>>>>> that you
>>>>>> could query the child table name directly, or you run the query
>>>>>> against
>>>>>> the parent (ie. logfile table) and get all the data.
>>>>>>
>>>>>> the indexes are done on a per table basis, so new data that comes
>>>>>> in, is
>>>>>> a lesser amount, and doesn't require re-indexing.
>>>>
>>>>
>>>>> PostgreSQL can do all of this too.
>>>>
>>>>> Sincerely,
>>>>
>>>>> Joshua D. Drake
>>>>
>>>>>>
>>>>>>
>>>>>> example:
>>>>>>
>>>>>> select * from logfile_070825 where datafield = 'foo'
>>>>>>
>>>>>> if i knew i wanted to specifically go into that child, or:
>>>>>>
>>>>>> select * from logfile where datafield = 'foo'
>>>>>>
>>>>>> and all child tables are searched and results merged.  You can
>>>>>> perform
>>>>>> any kind of sql query and field structures are you normally do.
>>>>>>
>>>>>> the downside is that the queries are run sequentially.
>>>>>>
>>>>>> so if you had 100 child tables, each table is queried via
>>>>>> indexes,
>>>>>> then
>>>>>> results are merged.
>>>>>>
>>>>>> but, this approach does allow me to dump alot of data in, without
>>>>>> having
>>>>>> the re-indexing issues you are facing.
>>>>>>
>>>>>> at some point, you could roll up the days, in to weekly child
>>>>>> tables,
>>>>>> then monthly tables, etc.
>>>>>>
>>>>>> I believe Bizgres has a new version of their system that does
>>>>>> parallel
>>>>>> queries which would certainly speed things up.
>>>>>>
>>>>>> For your documents, you can do it by the day it was checked
>>>>>> in, or
>>>>>> maybe
>>>>>> you have some other way of logically grouping, but the parent/
>>>>>> child
>>>>>> table structure really helped to solve my problem of adding in
>>>>>> millions
>>>>>> of records each day.
>>>>>>
>>>>>> The closest thing in mysql is using merge tables, which is not
>>>>>> really
>>>>>> practical when it comes time to do the joins to the tables.
>>>>>>
>>>>>> -brandon
>>>>>>
>>>>>> http://www.t3report.com - marketing intelligence for online
>>>>>> marketing
>>>>>> and affiliate programs
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------(end of
>>>>>> broadcast)---------------------------
>>>>>> TIP 6: explain analyze is your friend
>>>>>>
>>>>
>>>>
>>>>> - --
>>>>
>>>>>       === The PostgreSQL Company: Command Prompt, Inc. ===
>>>>> Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
>>>>> PostgreSQL solutions since 1997  http://www.commandprompt.com/
>>>>>             UNIQUE NOT NULL
>>>>> Donate to the PostgreSQL Project:
>>>>> http://www.postgresql.org/about/donate
>>>>> PostgreSQL Replication: http://www.commandprompt.com/products/
>>>>
>>>>>
>>
>>
>>> - ---------------------------(end of
>>> broadcast)---------------------------
>>> TIP 6: explain analyze is your friend
>>
>>
>>
>>> - --
>>
>>>       === The PostgreSQL Company: Command Prompt, Inc. ===
>>> Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
>>> PostgreSQL solutions since 1997  http://www.commandprompt.com/
>>>             UNIQUE NOT NULL
>>> Donate to the PostgreSQL Project: http://www.postgresql.org/about/
>>> donate
>>> PostgreSQL Replication: http://www.commandprompt.com/products/
>>
>>>
>
>
> - --
>
>       === The PostgreSQL Company: Command Prompt, Inc. ===
> Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
> PostgreSQL solutions since 1997  http://www.commandprompt.com/
>             UNIQUE NOT NULL
> Donate to the PostgreSQL Project: http://www.postgresql.org/about/
> donate
> PostgreSQL Replication: http://www.commandprompt.com/products/
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFG0HJ5ATb/zqfZUUQRAuEdAJwNwsr/XCsr85tElSVbRVMUHME+PACglbJK
> gj5cZgOtgEEjUPph0jpsOcw=
> =u7Ox
> -----END PGP SIGNATURE-----
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQIVAwUBRtCHvPyqRf6YpodNAQLDuBAAp+dg1MHZy+hjZ8Zk9OQTbeADgJqGPpi9
G7+y3iyaaqOF66TC52P7OaqO6nPhoNCNMCxwztnASyxpftGD5yJ4AZTSGcbAsWB9
cO5mE1mgbngZNPnRLypeJ81hyE6bniNNL7xXSq9LB8wRMczFZwGVZT66+lMIFjvv
0OrbAcSNUFqddky7EFm8gx6A2FNIzSdFB0dNbzpKwEOTnCHKvh+O99sAr/LB7mmL
Hj/wzeQKrWbDAB3+N9rczivZ03DvYAGbUY9qBfNj7Y9YL3iu/Q+Oy4bHtI6d/a7B
wepol2xe1sYEtQ+R3yMPXFte0483n8XIdXxa412ZSIEBfLxHzV6M7JTbPtgWwE+9
7xvyYbO7xQL9N/P8ZGg75eEqXtUrepGmJG0Y30qF5sNdMG0pWoz1bzDjSLNCnylq
JwsO8p1EHNPnPRqotwZZSfLUW16eREqLaOrSC84gIw5Q6zAMZe/k2ckzzHKPGB1c
sckaQROcgK4Lu9ywjRjBjNqclOMasf0MCrsDVMQE/wnh4GoDL/PAyEOqnlpvJ+cx
k4kmOrEz5GRZQehHUI7CdejFwZ32sAB+nV2r8zDW9FSxgoRoFvtE2hooJ9orv0IU
1F8TeBdifVP/Ef/lHAHs6IqEH45y72WqrWFZsIdU1PDe0MyfgMaOBwdwXNeZqky/
IF5SMKbl9yA=
=F9Oq
-----END PGP SIGNATURE-----

On Aug 24, 2007, at 7:41 PM, Benjamin Arai wrote:

> Hi,
>
> I have an application which loads millions of NEW documents each month
> into a PostgreSQL tsearch2 table.  I have the initial version
> completed
> and searching performance is great but my problem is that each time
> a new
> month rolls around I have to drop all the indexes do a COPY and re-
> index
> the entire table. This is problematic considering that each month
> takes
> longer than the previous to rebuild the indexes and the application in
> unavailable during the rebuilding process.
>
> In order to avoid the re-indexing I was thinking of instead
> creating a new
> table each month (building its indexes and etc) and accessing them all
> through a view. This way I only have to index the new data each month.
>
> Does this work?  Does a view with N tables make it N times slower for
> tsearch2 queries? Is there a better solution?


You can use Postgres's inheritance mechanism for your partitioning
mechanism and combine it with constraint exclusion to avoid the N^2
issues.  See:

http://www.postgresql.org/docs/8.2/interactive/ddl-inherit.html

and

http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html

Basically, create a table from which all of your partitioned tables
inherit.  Partition in such a way that you can use constraint
exclusion and then you can treat the parent table like the view you
were suggesting.

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



On Aug 25, 2007, at 2:58 PM, Erik Jones wrote:

>
> On Aug 24, 2007, at 7:41 PM, Benjamin Arai wrote:
>
>> Hi,
>>
>> I have an application which loads millions of NEW documents each
>> month
>> into a PostgreSQL tsearch2 table.  I have the initial version
>> completed
>> and searching performance is great but my problem is that each
>> time a new
>> month rolls around I have to drop all the indexes do a COPY and re-
>> index
>> the entire table. This is problematic considering that each month
>> takes
>> longer than the previous to rebuild the indexes and the
>> application in
>> unavailable during the rebuilding process.
>>
>> In order to avoid the re-indexing I was thinking of instead
>> creating a new
>> table each month (building its indexes and etc) and accessing them
>> all
>> through a view. This way I only have to index the new data each
>> month.
>>
>> Does this work?  Does a view with N tables make it N times slower for
>> tsearch2 queries? Is there a better solution?
>
>
> You can use Postgres's inheritance mechanism for your partitioning
> mechanism and combine it with constraint exclusion to avoid the N^2
> issues.  See:
>
> http://www.postgresql.org/docs/8.2/interactive/ddl-inherit.html
>
> and
>
> http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html
>
> Basically, create a table from which all of your partitioned tables
> inherit.  Partition in such a way that you can use constraint
> exclusion and then you can treat the parent table like the view you
> were suggesting.
>
> Erik Jones
>

Sorry, I didn't see that you had crossposted and carried the
conversation on another list.  Please, don't do that.  Avoid the top
posting, as well.   They both make it difficult for others to join in
or follow the conversations and issues.

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



On Aug 25, 2007, at 1:34 AM, Benjamin Arai wrote:

> There has to be another way to do incremental indexing without
> loosing that much performance.

This is the killer feature that prevents us from using the tsearch2
full text indexer on postgres.  we're investigating making a foreign
table from a SOLR full text index so our app only talks to Pg but the
text search is held in a good index.