Many applications that deal with unstructured data require access to the text content of formatted or marked-up documents. Organizations that archive documents often require access to the textual content to make the documents searchable and enable content aggregation, reporting and mining of the document archives. Search and retrieval application also need to extract and tokenize text from various file formats.
One standard mechanism to access and extract text from documents is provided by the IFilter plug-in interface used in Microsoft search engines. There are a few IFilter implementations developed by Microsoft and other vendors that cover a variety of file formats. The standard or reliability and text extraction quality varies across multiple IFilter developers.
Opait Text Filters is a small utility program with a simple interface to IFilters that are already installed on the host computer as well as a few custom text extraction filters which work directly with file formats and improve upon the default IFilter implementations.
The interface to extract text is provided by a small class library called Opait.Filters which is included and can be used to integrate text filters into .NET applications.
.NET Framework 4.5