File indexing and full-text searching

If you are running FileRun using Docker, please use this guide instead.

Introduction

Searching files by keywords in their contents requires additional configuration and third-party software.

The feature is enabled from Control Panel » System configuration » Files » Searching.

Please follow the next steps.

Install Apache Tika (Command line mode)

Without Apache Tika, FileRun will index only plain-text files for searching.

Apache Tika is used for extracting text contents from non-plain-text files, such as PDFs or office files.

Note: Your server needs to have Java support, in order to run Apache Tika.

You can read more about Apache Tika here: https://tika.apache.org

Running Tika in command line mode:

  1. Download the tika-app-[*].jar (note the app part in the file's name) file from here: https://tika.apache.org/download.html
  2. Set the path to the tika-app-[*].jar file inside FileRun's control panel

That's it!

Click the Test button to make sure it works. If Java is installed on the server and the file path is correct, you should see the Apache Tika version displayed as a result of the test.

Install Apache Tika (Server mode)

Running Tika in server mode usually speeds up the indexing process.

  1. Download the tika-server-[*].jar (note the server part in the file's name) file from here: https://tika.apache.org/download.html
  2. Start up the server: java -jar tika-server-[*].jar
  3. Set the hostname and port number (default 9998) of the Tika server
Click the Test server button to make sure it works. If everything is in order you should see the Apache Tika version displayed as a result of the test.

You can also run Apache Tika in server mode using Docker: https://github.com/LogicalSpark/docker-tikaserver

OCR

Enabling OCR will make the indexing process slow. There is no way around it. If you do not need to index image files or scanned PDF documents, do not enable this.

To get text out of image files, Apache Tika can make automatic use of Tesseract.

Read more info about it here: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR

Note: On Windows servers, if Apache Tika doesn't detect Tesseract, make sure you add the path of tesseract.exe to the PATH system variable.

OCR Scanned PDFs

By default, Apache Tika only looks for text contents in PDF documents.

Scanned PDF documents don't usually contain text, but photos of text.

Apache Tika needs to be told to read both text and OCR images, and that is done through an XML config file.

Copy the following in a text file and save it as tika-config.xml:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="ocrStrategy" type="string">ocr_and_text</param>
            </params>
        </parser>
    </parsers>
</properties>

Set the full path of the saved tika-config.xml file in the appropriate field in the FileRun control panel.

Testing Apache Tika with documents

To see if Apache Tika is able to properly extract text contents from files, you can enable the Admin: Text Indexer Test plugin from Control Panel » System configuration » Files » Plugins.

Once you enable the plugin, you might need to clear your browser's cache in order to see it.

Here's how to use it:

  • Reload the FileRun user interface.
  • Browse to an existing file and right-click it.
  • Choose Open with.. » Admin: Text Indexer Test
  • A popup window will open.
  • Inside the popup, FileRun will use Apache Tika to try to extract the text contents.
  • The extracted text contents will be displayed in the popup window.
  • If there is no text, FileRun will not index anything.

Install Elasticsearch

Note that FileRun has been currently tested only with Elasticsearch version 6 and it might not work fine with newer versions.

You can download and read more about Elasticsearch here: https://www.elastic.co/downloads/elasticsearch

Once you have an instance of Elasticsearch running, configure it inside FileRun:

You only need to set the URL of the host. If the server is password protected, include the credentials inside the URL:

http://username:password@your-elastic-server.com

Click the Test server to make sure FileRun can connect to it. If everything is in order you should see the Elasticsearch cluster name and list of nodes.

The Test server step is not optional, as FileRun is using this to create the index if it doesn't already exist.

Testing indexing

Save the changes made to the settings in FileRun's control panel. From this point forward, FileRun will queue the files you upload or create for indexation.

Please note that old/existing files will not be automatically indexed. If you wish to index these files, please see the available utility command lines.

To test the configuration, run the following from the FileRun server's command line:

cd /path/to/filerun/cron
php process_search_index_queue.php

It will show the progress of processing the search indexing queue. It will extract file contents using Apache Tika and send it to Elasticsearch for indexing.

If you are getting PHP errors, you might need to specify the path of your PHP configuration file:

php -c /path/to/php.ini process_search_index_queue.php

To find out the path of the “php.ini” used by FileRun create a file http://your-site.com/filerun/info.php, type <?php phpinfo(); inside and open the file in your browser.

If you are getting this error from the Elasticsearch server: FORBIDDEN/12/index read-only / allow delete (api)], run this to switch of the read-only flag on the index:

curl -X PUT 'http://127.0.0.1:9200/files/_settings' --data '{"index": {"blocks": {"read_only_allow_delete": "false"}}}' --header "Content-Type: application/json"

Automate the indexing task

As extracting the text from a binary file requires a lot of CPU processing, the files are queued and processed one at a time. This requires the script “cron/process_search_index_queue.php” to be executed frequently. We recommend running the script every 5 minutes or so. This way you will not have to wait to long until an uploaded file will be found by the search engine.

On a Linux server this can easily be done be setting up a cron job like this:

  1. Create a new text file at “cron/process_search_index_queue.sh” and write the following inside:
    php -c /path/to/php.ini process_search_index_queue.php
  2. Open a command line console (SSH)
  3. Open the crontab editor by running:
    crontab -e
  4. Write:
    * * * * * /path-to-filerun/cron/process_search_index_queue.sh
  5. Press :wq and Enter to save the changes and close the editor.
It is recommended that you temporarily configure cron with an e-mail address to receive the results of the command and make sure it works properly.

If your hosting service is running the cPanel administrative tool, it usually provides a web-based tool for setting up cron jobs easier.

On Windows this can be achieved by creating a Windows schedule event which calls a .BAT file containing something like this:

CD cron
C:/PHP/PHP.EXE process_search_index_queue.php