Recursive function in Java - thread safe collection

545 Views Asked by At

Lets suppose I have a xml documents in which i can find links to the other documents of the same type which can also have a links to another one documents. At starting point I have list of documents to read and analise. I have written following algorithm to read and analise those documents:

    private static List<String> documentNames = new ArrayList<String>();

    main(...) {
       //add names to documentNames arrayList above.
       for(String documentName : documentNames) {
           readDocument(documentName);
        }
    }

Function readDocument looks following:

       private static CopyOnWriteArrayList<String> visitURL(String documentName) {
       CopyOnWriteArrayList<String> visitedDocs = new CopyOnWriteArrayList<String>(); //visited Ref urls
         if (!visitedDocs .contains(documentName)) {
            analyseAndWriteOnDisk(documentName)    //it saves analised document on disk
            CopyOnWriteArrayList<String> tmp = visitURL(documentName);
            visitedDocs.addAll(tmp);
         } else {
            System.out.println(documentName " - I have seen it !");
         }
         return visitedDocs;
       }

It works, but after execution of the programm I can find duplicate files (files with the same content). I shouldnt have them - I prevent it by if-condition in function visitURL. My question is: what doesn't work here ? I suppose that something is wrong with with manipulation with array visitedDocs. How can I get on every recursion call actuall version of array with already visited files ?

Being as most precise as I can, I have a recursion function which operates on some collection X:

   recursion(CollectionType X) {
      someoperations(X)
      recursion(X)
   }

and X must be always actual.

2

There are 2 best solutions below

6
On

You should not use a recursive algorithm for that. It's easier to use a queue that holds all documents to analyze, and a set that holds all documents that are already analyzed. As long as the queue is not empty, you pull a document from it, analyze it and add extracted links to the queue if they are not already visited.

private Collection<String> visit(Collection<String> intialDocs) {
    Queue<String> documents = new LinkedBlockingQueue(initialDocs);
    Set<String> visited = new HashSet<>();
    while (!documents.isEmpty()) {
        String doc = documents.poll();
        visited.add(doc);

        Collection<String> links = analyzeDocument(doc);
        for(String link : links) {
            if (!visited.contains(link) documents.add(link);
        }
    }
    return visited;
}

private Collection<String> analyzeDocument(String document) {
    // TODO: analyze document and return a list of all links in that document
}

Usage:

Set<String> allVisitedDocuments = visit(documentNames);

Advantage of this iterative approach over recursive solution:

  • It is easier to see how it works.
  • It is easier to argue that it will terminate.
  • It is easier to debug.
  • It can easily be parallelized if required.
  • Order of document processing can easily be influenced by just changing the type of collection used for queueing the documents. (Now it performs a breadth-first search, if you use a LIFO like Stack instead, you get depth-first, and some priority-queue might let you decide based on document type or so).
  • If you have long series of linked documents, recursion may become very deep and a stack overflow may happen.

Note: If you don't use multiple threads, you should not use CopyOnWriteArrayList as it makes a full copy of its internal content on every write access!

2
On

Each time you call visitURL, you're creating a new instance of visitedDocs. So, it's empty every time at the beginning of the call, and at the end contains only the current iteration of tmp.

According to JavaDocs, you need to call the new one like this:

CopyOnWriteArrayList<String> visitedDocs = new CopyOnWriteArrayList<String>(documentNames) //here you need to add the parameter of the ArrayList you want to copy, otherwise you're instantiating a blank ArrayList.

Then, you'll need to set your documentNames equal to the returned visitedDocs.