Thread: tsearch2 and pdf files
I’m using Postgresql 8.1.5
Tsearch2 is installed and runs well
I’d like to use tsearch2 to index PDF files.
Do someone has a detailed process to implement that?
You just need software that extracts the text from it. Search google for pdf2txt and others. Printer drivers that try to get text from anything are available as well. On 11.12.2006 11:41, Philip Johnson wrote: > I'm using Postgresql 8.1.5 > > Tsearch2 is installed and runs well > > I'd like to use tsearch2 to index PDF files. > > Do someone has a detailed process to implement that? -- Regards, Hannes Dorbath
Do you know what kind of table should I use ? Is there a shell script or a php script that does the work ? regards > -----Message d'origine----- > De : pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] De la part de Hannes Dorbath > Envoyé : lundi 11 décembre 2006 12:21 > À : pgsql-general@postgresql.org > Objet : Re: [GENERAL] tsearch2 and pdf files > > You just need software that extracts the text from it. Search google for > pdf2txt and others. Printer drivers that try to get text from anything > are available as well. > > > On 11.12.2006 11:41, Philip Johnson wrote: > > I'm using Postgresql 8.1.5 > > > > Tsearch2 is installed and runs well > > > > I'd like to use tsearch2 to index PDF files. > > > > Do someone has a detailed process to implement that? > > > -- > Regards, > Hannes Dorbath > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings
1. Convert PDF to file with e.g xpdf 2. Insert parsed text to a table of your choice. 3. Make vectors from the text. Cheers, 11 dec 2006 kl. 18:23 skrev Philip Johnson: > Do you know what kind of table should I use ? > Is there a shell script or a php script that does the work ? > > regards > >> -----Message d'origine----- >> De : pgsql-general-owner@postgresql.org [mailto:pgsql-general- >> owner@postgresql.org] De la part de Hannes Dorbath >> Envoyé : lundi 11 décembre 2006 12:21 >> À : pgsql-general@postgresql.org >> Objet : Re: [GENERAL] tsearch2 and pdf files >> >> You just need software that extracts the text from it. Search >> google for >> pdf2txt and others. Printer drivers that try to get text from >> anything >> are available as well. >> >> >> On 11.12.2006 11:41, Philip Johnson wrote: >>> I'm using Postgresql 8.1.5 >>> >>> Tsearch2 is installed and runs well >>> >>> I'd like to use tsearch2 to index PDF files. >>> >>> Do someone has a detailed process to implement that? >> >> >> -- >> Regards, >> Hannes Dorbath >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 5: don't forget to increase your free space map settings > > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org/
> 1. Convert PDF to file with e.g xpdf > 2. Insert parsed text to a table of your choice. > 3. Make vectors from the text. Actually, if you're not going to use the headline() function, you cna just store it directly in a vector, cutting down on the size requirements. Just insert to the to_tsvector() result. The full text is required for headline() though, so you can't cheat on that. //Magnus
>> 1. Convert PDF to file with e.g xpdf >> 2. Insert parsed text to a table of your choice. >> 3. Make vectors from the text. > > Actually, if you're not going to use the headline() function, you cna > just store it directly in a vector, cutting down on the size > requirements. What size requirements ? > Just insert to the to_tsvector() result. The full text is > required for headline() though, so you can't cheat on that. > > //Magnus > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
> >> 1. Convert PDF to file with e.g xpdf > >> 2. Insert parsed text to a table of your choice. > >> 3. Make vectors from the text. > > > > Actually, if you're not going to use the headline() > function, you cna > > just store it directly in a vector, cutting down on the size > > requirements. > What size requirements ? If you store both text and tsvector, that's going to use up a lot more space than if you just store the tsvector. With a proper lexer and such, it will be *more* than twice as large, given that the tsvector will be smaller than the text. //Magnus