Using PDF text contents for search

The tutorial Formatting CSV data and enabling search covers the basic concept of using external CSV data to be used for displaying Large Tooltips in FSI Pages, and gives an overview on searching inside this CSV data within FSI Pages.

Using CSV Data to merge the images in FSI Pages with searchable text, is the most flexible way to provide such data, because additional fields can be searched for keywords, which do not need to appear anywhere in the visible text, as described in the tutorial mentioned above.

If you want to add keywords without adding hyperlinks and LargeTooltips, FSI Pages Converter can also detect and extract the text from the PDF source files and provide this text to the search in FSI Pages. This tutorial describes how to configure FSI Pages to extract text from the PDF source files and outlines the option to use Excel or CSV files as the source for keywords.

Search settings

In the “Text Search” section of FSI Pages Converter, please check the “Enable Text Search” box and in the “Sources” area, and set the “PDF – Get Texts from PDF Documents” option.
This will make FSI Pages Converter try to extract the text from the PDF files. Please note that due to the many different ways of how PDF files can be set up internally, some limitations to the text extraction of FSI Pages converter might apply. Some PDF documents contain graphics rather than text and thus FSI Pages Converter will not be able to extract keywoards from this kind of documents.

Additionally certain fonts which are not embedded in the PDF (e.g. Identity-H) might prevent FSI Pages Converter from extracting keywoards properly. In this case you might want to provide keywoards by means of CSV or Excel files instead as described below.

Text filter

Another important thing to configure when using the text extraction capability of FSI Pages Converter is the Text filter. This defines, what data is written to the TIFF file’s metadata on converting.
(Technical Note: Setting false parameters here may lead to missing data inside the TIFF files. The data is written to the TIFF files upon conversion. Changing parameters here later on requires converting the source PDF to TIFF files again.)

As the name suggests, the minimum and maximum word width defines the number of characters of words to be extracted. If you set the parameters as shown, searching for “as” or “someratherlongwordexceeding32chars” will not yield any results, as they are less than 3 or more than 32 characters.

In the “Stop Word List” option please choose the language of the PDF files text. The stop word list contains very common words, like e.g. “and” or “some” in the chosen language, which searching for will not make sense and therefore these keywords will not be added. You can add or edit stop word lists by modifying or adding files to the “stopwords” directory of your FSI Pages Converter setup.

Finally the “Valid Char Blocks” Option defines the valid characters of the text being extracted and searched in. This means if you render e.g. a catalog that contains Greek characters, you need to choose “Greek” here in order for the greek text to be available when searching.

Using CSV Data

In case the text extraction does not result in sufficient keywords or if you want to add keywords that are not part of a page’s text, you can use CSV or Excel files to provide search data. To do so, a simple CSV or Excel file is required, which contains at least two columns: Pagenumber and Text. Other fields (such as an additional “Search” column) are optional.
In the “External Data” tab of FSI Pages Converter select the appropriate file and set the search as follows. This defines the “Text” and “Search” as searchable data, while the column defined in the “Page Index Field” defines the number of the page containing it.

Converting PDF documents with a configuration like this will make FSI Pages Converter add the keywords from the Excel or CSV file rather than from the text contained in the PDF document.

Further information on using CSV data with FSI Pages Converter is available in the tutorial Formatting CSV data and enabling search.

Publishing

When publishing an FSI Pages instance you need to enable searching in the “FSI Pages” tab in the “Publish as FSI Pages” section of the imaging server web interface. This will enable users to search for any word in your catalog that meets the text filter criteria and see the search result as shown: