OneNote support for Apache Tika parsers

916 Views Asked by At

I want to be able to detect mime types for .one, .onetoc, .onetoc2 files using apache tika. However from their documentation https://tika.apache.org/1.14/formats.html does not seem to have support for it. Using purely file parsing techniques using Tika I always get application/octet-stream instead of application/onenote.

They do support based extension and name based introspection to determine the mime type but that is unreliable as I can always name a file *.one and it would throw mime type as 'application/onenote' which is incorrect.

Any pointers on any library available that can easily detect if a given file is of onenote type or is there something I am missing in Tika?

1

There are 1 best solutions below

3
On

For mime-magic driven OneNote file detection, you need Apache Tika 1.15 or later.

For OneNote parsing (metadata, text etc), you either need to wait for Apache 1.24 to be released (due March-ish 2020), or build yourself from source including the patches from Github pull request #303 / TIKA-2224.

And if you're a Tika + OneNote user, give a big thanks to Nicholas DiPiazza (who did most of the work), and Tim Allison (who help review/steer/etc)