Thread: Help, 400 million rows, query slow, I stopped after 5 days

Help, 400 million rows, query slow, I stopped after 5 days

From
"Christian Hofmann"
Date:
Hello,

I have a little table with 400,000,000 rows of text (about 40 chars each).
I want to delete values that are exisiting more than two times.

So I created a new table (table2), the same than the first, but with an
unique index.
Then i put in the following query:

INSERT INTO table2 (my_col) SELECT DISTINCT my_col from table1

But this query was so slow that I stopped it. I let it run for five days!

In the Server there is only 1 GB RAM. My_col is using a bitmap index.

Maybe I should creating a trigger on table2 that will check before inserting
if this value is already in the table2.


Or is it possible to just use the unique index in table2 and do something
like:

INSERT INTO table2 (my_col) SELECT my_col from table1

And ignoring there errors that are thrown because there are duplicate
values?

I hope you can help me,

Thank you,

Christian



Re: Help, 400 million rows, query slow, I stopped after 5 days

From
Tom Lane
Date:
"Christian Hofmann" <christian.hofmann@gmx.de> writes:
> INSERT INTO table2 (my_col) SELECT DISTINCT my_col from table1
> But this query was so slow that I stopped it. I let it run for five days!

What PG version is this, and what do you have sort_mem set to?
Increasing sort_mem to a few hundred meg should help, assuming the
server can spare that much RAM.

> In the Server there is only 1 GB RAM. My_col is using a bitmap index.

There are no bitmap indexes in Postgres...

            regards, tom lane

Re: Help, 400 million rows, query slow, I stopped after 5 days

From
"Christian Hofmann"
Date:
Hello Tom,

Thank you for you fast answer.

> What PG version is this, and what do you have sort_mem set to?

I am running 8.1.1 on Windows Server 2003 web edition.

Regarding "Memor" these two values are the onliest that are set:

shared_buffers = 1000            # min 16 or max_connections*2, 8KB
each
#temp_buffers = 40000            # min 100, 8KB each
#max_prepared_transactions = 5        # can be 0 or more
# note: increasing max_prepared_transactions costs ~600 bytes of shared
memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
work_mem = 650240            # min 64, size in KB
...

> Increasing sort_mem to a few hundred meg should help, assuming the
> server can spare that much RAM.

Maybe decreasing work_mem to 50000 and setting sort_mem to 600000 would
help.
But where can I find the sort_mem? In postgresql.conf is nothing with that
keyword :-(

> There are no bitmap indexes in Postgres...

The index is using "USING btree", I thought it was a kind of bitmap index.
But I think it is a binary tree then :-)

Thank you,

Christian



Re: Help, 400 million rows, query slow, I stopped after 5 days

From
Tom Lane
Date:
"Christian Hofmann" <christian.hofmann@gmx.de> writes:
> work_mem = 650240            # min 64, size in KB

> But where can I find the sort_mem? In postgresql.conf is nothing with that
> keyword :-(

work_mem and sort_mem are the same ... so looks like you already pushed
it up.  Actually, maybe you pushed it up too far --- are you sure that
the machine has 650M RAM available to spare?  On Unix I'd suggest
watching free memory and swap activity to tell if it's too high, but
I dunno how to tell on Windows.

            regards, tom lane

Re: Help, 400 million rows, query slow, I stopped after 5 days

From
"Christian Hofmann"
Date:
Hello Tom,

> work_mem and sort_mem are the same ... so looks like you
> already pushed
> it up.  Actually, maybe you pushed it up too far --- are you sure that
> the machine has 650M RAM available to spare?  On Unix I'd suggest
> watching free memory and swap activity to tell if it's too high, but
> I dunno how to tell on Windows.

Okay, maybe I set it to high. I will try to find that out.

Maybe I should run 26 queries like these:

INSERT INTO table2 (my_col) SELECT DISTINCT my_col from table1 where my_cole
like 'a%'

So postgresql could use my index on my_col and store more data in memory. Do
you think that would be better?

Thank you,

Christian