I have just short of 2 million XML documents sitting on 16gb of file system space. They are all valid and share a single DTD. They are all of roughly equal size (all generated by the same lab information system).
I'm looking for an easy way for a single user to query the whole 2M doc corpus. I'm not looking to expose this to the web or even multiple LAN users; however, I would like it be able to expose some query interface to my intranet. I'm flexible on the query language but I would like to be able to do ad hoc queries. I want it to be at least simi-performant and I'm willing to dedicate additional disk space as needed to accommodate indexes.
A workable solution has to be deplorable on a single quad core Linux box with 8gb of RAM, new hardware isn't an option.
I found e-Xist DB but it doesn't seem to have all that much in the way of activity and the demo site is down.
I would try in this order:
My hunch is that Berkeley would be the fastest, but BaseX and Sedna are both network-accessible and BaseX would be the easiest to start using and querying. Sedna also has a schema-aware storage system which might be beneficial for the situation you describe. Berkeley's sleepycat license may be an encumbrance for you if you have a commercial use--look at it carefully.