If you are running FileRun using Docker, please use this guide instead.
Searching files by keywords in their contents requires additional configuration and third-party software.
The feature is enabled from Control Panel → Files → Searching.
Please follow the next steps.
Without Apache Tika, FileRun will index only plain-text files for searching.
Apache Tika is used for extracting text contents from non-plain-text files, such as PDFs or office files.
Note: Your server needs to have Java support, in order to run Apache Tika.
You can read more about Apache Tika here: https://tika.apache.org
Running Tika in command line mode:
tika-app-[*].jar
(note the app
part in the file's
name) file from here: https://tika.apache.org/download.html - Set
the path to the tika-app-\[\*\].jar
file inside FileRun's
control panelThat's it!
Click the
Test
button to make sure it works. If Java is installed on the server and the file path is correct, you should see the Apache Tika version displayed as a result of the test.
Running Tika in server mode usually speeds up the indexing process.
tika-server-[*].jar
(note the server
part in the
file's name) file from here:
https://tika.apache.org/download.html - Start up the server:
java -jar tika-server-\[\*\].jar
Click the
Test server
button to make sure it works. If everything is in order you should see the Apache Tika version displayed as a result of the test.
You can also run Apache Tika in server mode using Docker: https://github.com/LogicalSpark/docker-tikaserver
Enabling OCR will make the indexing process slow. There is no way around it. If you do not need to index image files or scanned PDF documents, do not enable this.
To get text out of image files, Apache Tika can make automatic use of Tesseract.
Read more info about it here: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
Note: On Windows servers, if Apache Tika doesn't detect Tesseract, make
sure you add the path of tesseract.exe
to the PATH system variable.
By default, Apache Tika only looks for text contents in PDF documents.
Scanned PDF documents don't usually contain text, but photos of text.
Apache Tika needs to be told to read both text and OCR images, and that is done through an XML config file.
Copy the following in a text file and save it as tika-config.xml
:
1<?xml version="1.0" encoding="UTF-8"?>
2<properties>
3 <parsers>
4 <parser class="org.apache.tika.parser.DefaultParser"/>
5 <parser class="org.apache.tika.parser.pdf.PDFParser">
6 <params>
7 <param name="ocrStrategy" type="string">ocr_and_text</param>
8 </params>
9 </parser>
10 </parsers>
11</properties>
Set the full path of the saved tika-config.xml
file in the appropriate
field in the FileRun control panel.
To see if Apache Tika is able to properly extract text contents from
files, right-click a file, select the Control panel option
under
More options
. You will find in there troubleshooting tools which will
help you find out if the text extraction for the file works and what is
the result.
You can download and read more about Elasticsearch here: https://www.elastic.co/downloads/elasticsearch
Once you have an instance of Elasticsearch running, configure it inside FileRun:
You only need to set the URL of the host. If the server is password protected, include the credentials inside the URL:
http://username
:password
@your-elastic-server.com
Click the Test server
to make sure FileRun can connect to it. If
everything is in order you should see the Elasticsearch cluster name and
list of nodes.
The
Test server
step is not optional, as FileRun is using this to create the index if it doesn't already exist.
Save the changes made to the settings in FileRun's control panel. From this point forward, FileRun will queue the files you upload or create for indexation.
Please note that old/existing files will not be automatically indexed. If you wish to index these files, please see the available utility command lines.
To test the configuration, run the following from the FileRun server's command line:
1cd /path/to/filerun/cron
2php process_search_index_queue.php
It will show the progress of processing the search indexing queue. It will extract file contents using Apache Tika and send it to Elasticsearch for indexing.
If you are getting PHP errors, you might need to specify the path of your PHP configuration file:
1php -c /path/to/php.ini process_search_index_queue.php
To find out the path of the "php.ini" used by FileRun create a file
http://your-site.com/filerun/info.php, type <?php phpinfo();
inside
and open the file in your browser.
If you are getting this error from the Elasticsearch server:
FORBIDDEN/12/index read-only / allow delete (api)]
, run this to switch
of the read-only flag on the index:
1curl -X PUT 'http://127.0.0.1:9200/files/_settings' --data '{"index": {"blocks": {"read_only_allow_delete": "false"}}}' --header "Content-Type: application/json"
As extracting the text from a binary file requires a lot of CPU processing, the files are queued and processed one at a time. This requires the script "cron/process_search_index_queue.php" to be executed frequently. We recommend running the script every 5 minutes or so. This way you will not have to wait to long until an uploaded file will be found by the search engine.
On a Linux server this can easily be done be setting up a cron job like this:
:wq
and Enter
to save the changes and close the editor.It is recommended that you temporarily configure cron with an e-mail address to receive the results of the command and make sure it works properly.
If your hosting service is running the cPanel administrative tool, it usually provides a web-based tool for setting up cron jobs easier.
On Windows this can be achieved by creating a Windows schedule event which calls a .BAT file containing something like this:
1CD cron
2C:/PHP/PHP.EXE process_search_index_queue.php