I am attempting to create an inverted index to cycle through a folder of text files, normalize each token within the text files, and save it in a treeMap. The result should produce a list of tokens and list the documents that the word appears in (not how many times it appears though). Example: and 1 2 4 6 big 2 5 7 house 4 8 etc... The issue I am having is with the looping through the documents and ensuring that there are non-duplicate numbers. I just received some help in regards to using a set within a map so that all values are distinct and not duplicate. The issue I am having now is 1.) trying to figure out how to store (ideally directly) those values into the set nested within the map. I as was having issues with this, I tried using a secondary set and then pushing those into the map nested set, but it just results in me saving all document numbers for all tokens. The problem can be seen at lines 47 and 48, and yes I am aware that those two lines are problematic, hence asking for help. Those were just two things I tried before deciding to post.
package assignment1;
import java.util.*;
import com.sun.jdi.Value;
import java.io.*;
public class Assignment1
{
public static void main(String args[])
{
// Create a HashMap object called invertedIndex
Map<String, Set<String>> invertedIndex = new TreeMap<>();
Set<String> docSet = new TreeSet<>();
int docNum = 0;
try //Tries to execute the block of code nested and then follows with a catch clause for exceptions
{
System.out.print("Enter directory name: "); //Input for file directory
Scanner dirScanner = new Scanner(System.in); //
File dir = new File(dirScanner.nextLine()); //Reads in the directory
File[] files = dir.listFiles(); //Creates an array of files contained within the directory
for (File f : files) //For clause cycles through outputting the file names, and with nested while clauses, outputs the text contained within the files
{
docNum++;
System.out.println("---DOC NUM: " + Integer.toString(docNum));
Scanner fileScanner = new Scanner(f);
while (fileScanner.hasNextLine())
{
StringTokenizer tokenizer = new StringTokenizer(fileScanner.nextLine());
while (tokenizer.hasMoreTokens())
{
String tempToken = tokenizer.nextToken();
tempToken = tempToken.replaceAll("[^a-zA-Z0-9]", " ");
tempToken = tempToken.toLowerCase();
System.out.println("token: " + tempToken);
docSet.add(Integer.toString(docNum));
invertedIndex.put(tempToken, docSet);
}
}
fileScanner.close();
}
dirScanner.close();
System.out.println();
// Print keys and values
for (String i : invertedIndex.keySet())
{
System.out.println(i + " " + invertedIndex.get(i));
}
} catch (Exception e)
{
System.out.println(e.toString());
}
}
}
Current output with the above code:
Enter directory name: C:\Users\SomeUser\Downloads\keeper
---DOC NUM: 1
token: the
token: old
token: night
token: keeper
token: keeps
token: the
token: keep
token: in
token: the
token: town
---DOC NUM: 2
token: in
token: the
token: big
token: old
token: house
token: in
token: the
token: big
token: old
token: gown
---DOC NUM: 3
token: the
token: house
token: in
token: the
token: town
token: had
token: the
token: big
token: old
token: keep
---DOC NUM: 4
token: where
token: the
token: old
token: night
token: keeper
token: never
token: did
token: sleep
---DOC NUM: 5
token: the
token: night
token: keeper
token: keeps
token: the
token: keep
token: in
token: the
token: night
---DOC NUM: 6
token: and
token: keeps
token: in
token: the
token: dark
token: and
token: sleeps
token: in
token: the
token: light
and [1, 2, 3, 4, 5, 6]
big [1, 2, 3, 4, 5, 6]
dark [1, 2, 3, 4, 5, 6]
did [1, 2, 3, 4, 5, 6]
gown [1, 2, 3, 4, 5, 6]
had [1, 2, 3, 4, 5, 6]
house [1, 2, 3, 4, 5, 6]
in [1, 2, 3, 4, 5, 6]
keep [1, 2, 3, 4, 5, 6]
keeper [1, 2, 3, 4, 5, 6]
keeps [1, 2, 3, 4, 5, 6]
light [1, 2, 3, 4, 5, 6]
never [1, 2, 3, 4, 5, 6]
night [1, 2, 3, 4, 5, 6]
old [1, 2, 3, 4, 5, 6]
sleep [1, 2, 3, 4, 5, 6]
sleeps [1, 2, 3, 4, 5, 6]
the [1, 2, 3, 4, 5, 6]
town [1, 2, 3, 4, 5, 6]
where [1, 2, 3, 4, 5, 6]
Attempted to use nested if-else statements to check if 1.) was not a token stored (in which case it would store the token in the map and the current docNum, 2.) if the current token was already stored AND its value would equal the currently saved docNum plus the current docNum (as a string, not an integer), 3.) else add the current docNum to the saved docNum.
Changed the hashMap I was using to a treeMap instead for ordering, and changed Map<String, String> to Map<String, Set> (Comments in code still reflect HashMap instead of TreeMap)
Counldn't figure out how to store docNum into the nested Set within the TreeMap, so created docSet to save the docNum and then push it to the map (obviously didn't work)
After fiddling around a bit I was able to come up with the below code:
Which produces the below output: