How to expose large collection of XML documents (~2M) for offline querying (xpath)?

937 Views Asked by At

I have just short of 2 million XML documents sitting on 16gb of file system space. They are all valid and share a single DTD. They are all of roughly equal size (all generated by the same lab information system).

I'm looking for an easy way for a single user to query the whole 2M doc corpus. I'm not looking to expose this to the web or even multiple LAN users; however, I would like it be able to expose some query interface to my intranet. I'm flexible on the query language but I would like to be able to do ad hoc queries. I want it to be at least simi-performant and I'm willing to dedicate additional disk space as needed to accommodate indexes.

A workable solution has to be deplorable on a single quad core Linux box with 8gb of RAM, new hardware isn't an option.

I found e-Xist DB but it doesn't seem to have all that much in the way of activity and the demo site is down.

4

There are 4 best solutions below

2
On BEST ANSWER

I would try in this order:

  1. BaseX (Has nice GUI. Most promising open source XML db I've found. BSD license)
  2. Sedna (My favorite before BaseX. Apache 2.0 license)
  3. Berkeley DB-XML (Is an embedded flat-file DB. Sleepycat license)
  4. eXist (eXist has always been a hacky disaster. GNU LGPL license)

My hunch is that Berkeley would be the fastest, but BaseX and Sedna are both network-accessible and BaseX would be the easiest to start using and querying. Sedna also has a schema-aware storage system which might be beneficial for the situation you describe. Berkeley's sleepycat license may be an encumbrance for you if you have a commercial use--look at it carefully.

1
On

My preference is to create inverted index using full-text search engine. Below are my preferences. I suggest you spend time on researching these 3.

  1. Solr (Web interface for querying, easy to get started)
  2. ElasticSearch (Distributed, easy to get started)
  3. Raw Lucene (1 & 2 use Lucene behind the scenes)

Why full-text-search engines?

  1. Faster
  2. Highlighting
  3. Faceting
  4. Allows free-form search (with xml dbs you will be working against xpath or xquery or something)
  5. Proven to search faster even with huge set of files
  6. file-based
2
On

You definitely want an XML database. I would say the emerging leaders are MarkLogic for a commercial product, eXist for open source. Others might have other views. Getting to grips with a new database product is always a steep learning curve (and the more capable the database, the more there is to learn). But eXist can certainly hack it, don't give up at the first hurdle.

0
On

I agree with Michale Kay. Use eXist-db if you want open source and MarkLogic if you want commercial. I did a project for the US library of congress NDIIPP program and after an extensive ATAM analysis and we selected eXist as superior to the other systems due to its active user community and widespread use. If you have doubts just do a search on MarkMail. I think you will find that eXist has a more active discussion than any other system.

There are about 350 pages of the report on line here:

http://www.mnhs.org/preserve/records/legislativerecords/pilot.htm