Home > mailing lists

Re: Indexing MS/Open Office and PDF documents - Mailing list pgsql-general

From	Richard Huxton
Subject	Re: Indexing MS/Open Office and PDF documents
Date	March 15, 2012 18:18:06
Msg-id	4F625C7B.7090302@archonet.com Whole thread Raw
In response to	Re: Indexing MS/Open Office and PDF documents (Jeff Davis <pgsql@j-davis.com>)
List	pgsql-general

Tree view

On 15/03/12 21:12, Jeff Davis wrote:
> On Fri, 2012-03-16 at 01:57 +0530, Alexander.Bagerman@cognizant.com

>> We have
>> hard time identifying MS/Open Office and PDF parsers to index stored
>> documents and make them available for text searching.

> The first step is to find a library that can parse such documents, or
> convert them to a format that can be parsed.

I've used docx2txt and pdf2txt and friends to produce text files that I
then index during the import process. An external script runs the whole
process. All I cared about was extracting raw text though, this does
nothing to identify headings etc.

--
   Richard Huxton
   Archonet Ltd

pgsql-general by date:

From: Jeff Davis
Date: 15 March 2012, 18:13:14
Subject: Re: Indexing MS/Open Office and PDF documents

From: dennis jenkins
Date: 15 March 2012, 18:22:04
Subject: Re: Indexing MS/Open Office and PDF documents

Re: Indexing MS/Open Office and PDF documents - Mailing list pgsql-general

Previous

Next