Lets suppose I have a xml documents in which i can find links to the other documents of the same type which can also have a links to another one documents. At starting point I have list of documents to read and analise. I have written following algorithm to read and analise those documents:
private static List<String> documentNames = new ArrayList<String>();
main(...) {
//add names to documentNames arrayList above.
for(String documentName : documentNames) {
readDocument(documentName);
}
}
Function readDocument looks following:
private static CopyOnWriteArrayList<String> visitURL(String documentName) {
CopyOnWriteArrayList<String> visitedDocs = new CopyOnWriteArrayList<String>(); //visited Ref urls
if (!visitedDocs .contains(documentName)) {
analyseAndWriteOnDisk(documentName) //it saves analised document on disk
CopyOnWriteArrayList<String> tmp = visitURL(documentName);
visitedDocs.addAll(tmp);
} else {
System.out.println(documentName " - I have seen it !");
}
return visitedDocs;
}
It works, but after execution of the programm I can find duplicate files (files with the same content). I shouldnt have them - I prevent it by if-condition in function visitURL. My question is: what doesn't work here ? I suppose that something is wrong with with manipulation with array visitedDocs. How can I get on every recursion call actuall version of array with already visited files ?
Being as most precise as I can, I have a recursion function which operates on some collection X:
recursion(CollectionType X) {
someoperations(X)
recursion(X)
}
and X
must be always actual.
You should not use a recursive algorithm for that. It's easier to use a queue that holds all documents to analyze, and a set that holds all documents that are already analyzed. As long as the queue is not empty, you pull a document from it, analyze it and add extracted links to the queue if they are not already visited.
Usage:
Advantage of this iterative approach over recursive solution:
Stack
instead, you get depth-first, and some priority-queue might let you decide based on document type or so).Note: If you don't use multiple threads, you should not use
CopyOnWriteArrayList
as it makes a full copy of its internal content on every write access!