Data retrieval / search in text

411 Views Asked by At

I am working on a selfProjet for my own interest on data retrieval. I have one text file with the following format.

.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...

".I 1" indicate the beginning of chunk of text corresponding to doc ID1 and ".I 2" indicates the beginning of chunk of text corresponding to doc ID2.

I did:

  • split the docs and put them in separate files
  • delete stopwords (and, or, while, is, are, ...)
  • stem the words to get the root of each (achievement, achieve, achievable, ...all converted to achiv and so on)
  • and finally create e TreeMultiMap which looks like this:

{key: word}  {Values are arraylist of docID and frequency of that word in that docID} 

aerodynam  [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
book       [[Doc_00562,6],[Doc_01111,1]]
....
....
result     [[Doc_00010,5]]
....
....

zzzz       [[Doc_01235,1]]

Now my questions: Suppose that user is interested to know:

  1. what documents does have achieving and book? (idea)
  2. documents which has achieving and skills but not book nor video
  3. document include Aerodynamic
  4. and some other simple queries like this

(input) so suppose she enters

  • achieving AND book
  • (achieving AND skills) AND (NOT (book AND video))
  • Aerodynamic
  • .....and some other simple queries

(Output)

  • [Doc_00562,6],[Doc_01121,5],[Doc_01151,3],[Doc_00012,2],[Doc_00001,1]
  • ....

as you can see there might be

  • Some precedence modifier (parenthesis which we dont know the depth)
  • precedence of AND, OR, NOT
  • and some other interesting challenges and issues

So, I would like to run the queries against the TreeMultimap and search in the words(key) and retrieve the Values(list of docs) to user.

how should I think about this problem and how to design my solution? what articles or algorithms should i read? any idea would be appreciated. (thanks for reading this long post)

1

There are 1 best solutions below

1
On

The collection that you have used is the Cranfield test collection, which I believe has around 3000 documents. While for collections of this size, it is okay to store the inverted list (the data structure that you have constructed) in memory with a hash-based or trie based organization, for realistic collections of much larger sizes, often comprised of millions of documents, you would find it difficult to store the inverted list entirely within memory in such cases.

Instead of reinventing the wheel, the practical solution is thus to make use of a standard text indexing (and retrieval) framework such as Lucene. This tutorial should help you to get started.

The questions that you seek to address can be answered by Boolean queries where you can specify set of Boolean operators AND, OR and NOT between its constituent terms. Lucene supports this. Have a look at the API doc here and a related StackOverflow question here.

The Boolean query retrieval algorithm is very simple. The list elements (i.e. the document ids) corresponding to each term are stored in sorted order so that at run-time it is possible to compute the union and intersection in time linear to the size of the lists, i.e. O(n1+n2).... (this is very similar to mergesort). You can find more information in this book chapter.