File indexing and full-text searching
If you are running FileRun using Docker, please use this guide instead.
Introduction
Searching files by keywords in their contents requires additional configuration and third-party software.
The feature is enabled from Control Panel » Files » Searching.
Please follow the next steps.
Install Apache Tika (Command line mode)
Without Apache Tika, FileRun will index only plain-text files for searching.
Apache Tika is used for extracting text contents from non-plain-text files, such as PDFs or office files.
Note: Your server needs to have Java support, in order to run Apache Tika.
You can read more about Apache Tika here: https://tika.apache.org
Running Tika in command line mode:
- Download the
tika-app-[*].jar
(note theapp
part in the file's name) file from here: https://tika.apache.org/download.html - Set the path to the
tika-app-[*].jar
file inside FileRun's control panel
That's it!
Click theTest
button to make sure it works. If Java is installed on the server and the file path is correct, you should see the Apache Tika version displayed as a result of the test.
Install Apache Tika (Server mode)
Running Tika in server mode usually speeds up the indexing process.
- Download the
tika-server-[*].jar
(note theserver
part in the file's name) file from here: https://tika.apache.org/download.html - Start up the server:
java -jar tika-server-[*].jar
- Set the hostname and port number (default 9998) of the Tika server
Click theTest server
button to make sure it works. If everything is in order you should see the Apache Tika version displayed as a result of the test.
You can also run Apache Tika in server mode using Docker: https://github.com/LogicalSpark/docker-tikaserver
OCR
Enabling OCR will make the indexing process slow. There is no way around it. If you do not need to index image files or scanned PDF documents, do not enable this.
To get text out of image files, Apache Tika can make automatic use of Tesseract.
Read more info about it here: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
Note: On Windows servers, if Apache Tika doesn't detect Tesseract, make sure you add the path of tesseract.exe
to the PATH system variable.
OCR Scanned PDFs
By default, Apache Tika only looks for text contents in PDF documents.
Scanned PDF documents don't usually contain text, but photos of text.
Apache Tika needs to be told to read both text and OCR images, and that is done through an XML config file.
Copy the following in a text file and save it as tika-config.xml
:
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"/> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="ocrStrategy" type="string">ocr_and_text</param> </params> </parser> </parsers> </properties>
Set the full path of the saved tika-config.xml
file in the appropriate field in the FileRun control panel.
Testing Apache Tika with documents
To see if Apache Tika is able to properly extract text contents from files, right-click a file, select the Control panel option
under More options
. You will find in there troubleshooting tools which will help you find out if the text extraction for the file works and what is the result.
2/3 Install Elasticsearch
You can download and read more about Elasticsearch here: https://www.elastic.co/downloads/elasticsearch
Once you have an instance of Elasticsearch running, configure it inside FileRun:
You only need to set the URL of the host. If the server is password protected, include the credentials inside the URL:
http://username
:password
@your-elastic-server.com
Click the Test server
to make sure FileRun can connect to it. If everything is in order you should see the Elasticsearch cluster name and list of nodes.
TheTest server
step is not optional, as FileRun is using this to create the index if it doesn't already exist.
Testing indexing
Save the changes made to the settings in FileRun's control panel. From this point forward, FileRun will queue the files you upload or create for indexation.
Please note that old/existing files will not be automatically indexed. If you wish to index these files, please see the available utility command lines.
To test the configuration, run the following from the FileRun server's command line:
cd /path/to/filerun/cron php process_search_index_queue.php
It will show the progress of processing the search indexing queue. It will extract file contents using Apache Tika and send it to Elasticsearch for indexing.
If you are getting PHP errors, you might need to specify the path of your PHP configuration file:
php -c /path/to/php.ini process_search_index_queue.php
To find out the path of the “php.ini” used by FileRun create a file http://your-site.com/filerun/info.php, type <?php phpinfo();
inside and open the file in your browser.
If you are getting this error from the Elasticsearch server: FORBIDDEN/12/index read-only / allow delete (api)]
, run this to switch of the read-only flag on the index:
curl -X PUT 'http://127.0.0.1:9200/files/_settings' --data '{"index": {"blocks": {"read_only_allow_delete": "false"}}}' --header "Content-Type: application/json"
Automate the indexing task
As extracting the text from a binary file requires a lot of CPU processing, the files are queued and processed one at a time. This requires the script “cron/process_search_index_queue.php” to be executed frequently. We recommend running the script every 5 minutes or so. This way you will not have to wait to long until an uploaded file will be found by the search engine.
On a Linux server this can easily be done be setting up a cron job like this:
- Create a new text file at “cron/process_search_index_queue.sh” and write the following inside:
php -c /path/to/php.ini process_search_index_queue.php
- Open a command line console (SSH)
- Open the crontab editor by running:
crontab -e
- Write:
* * * * * /path-to-filerun/cron/process_search_index_queue.sh
- Press
:wq
andEnter
to save the changes and close the editor.
It is recommended that you temporarily configure cron with an e-mail address to receive the results of the command and make sure it works properly.
If your hosting service is running the cPanel administrative tool, it usually provides a web-based tool for setting up cron jobs easier.
On Windows this can be achieved by creating a Windows schedule event which calls a .BAT file containing something like this:
CD cron C:/PHP/PHP.EXE process_search_index_queue.php