We have a solution that analyzes the content of various documents (word, pdf, xls, csv, etc.) to detect certain patterns of interest. The majority of the solution is written in Golang, but we are currently using Apache Tika to extract text content from these various documents (including compressed formats).
It's our goal to run this solution on endpoints where CPU/RAM is at a premium. Thus the solution must be lightweight and optimized.
Problem: Although the solution achieves our functional objectives we are not fully satisfied with the performance, startup time, memory density and frankly would like to get rid of the JVM altogether in our solution.
Ideally, we could leverage Tika as a native executable without the overhead of the JVM and the required Tika components could be called from our Golang solution.
We've so far been able to get a prototype working using Quarkus and GraalVM where Tika is used as a native image ([login to view URL]).
That being said, we have a small team and would like some help to define the most optimal way to achieve our AOT objectives with Tika.
We are looking for an expert who has experience with Tika and AOT compilation to support us with this project.
We need help defining, refining and implementing the optimal architecture in support of our project. Ultimately we'd like to deploy our solution to Windows / Linux / MacOS targets with the native Tika approach.
-= Please reply with a description of your relevant expertise with regards to Tika, native usage and comments relevant for this project =-