Getting all positions of an occuring String using StringBuilder.indexOf()

295 Views Asked by At

Java Beginner over here. I'm currently working on a programm that searches a part of the human DNA. Specifically, I want to find all occurences of a String within a StingBuilder, using StringBuilder.indexOf(). But I need all occurences, not just the first.

Code:

public void search(String motive){
    int count = 0;
    gene.indexOf(motive);   // gene is the Stringbuilder
    count++;


}

I need all occurences of motive in the gene StringBuilder plus the counter how often motive is in gene. Any help, since indexOf() only displays the first occurence?

1

There are 1 best solutions below

2
On BEST ANSWER

I take it that you are looking for indices of a specific nucleotide sequence within a gene sequence or sub-sequence. The following example class demonstrates a generic approach using Java's regular expression library to find such:

package jcc.tj.dnamatch;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Gene {
   private String gene;

   public Gene() {}

   public Gene( String gene ) {
      this.gene = gene;
   }

   public List<Integer> find( String seq ) {
      List<Integer> indices = new ArrayList<Integer>();

      Pattern pat = Pattern.compile( seq );
      Matcher m = pat.matcher( gene );

      while ( m.find() )
         indices.add( m.start() );

      return indices;
   }

   public String getGene() {
      return gene;
   }

   public void setGene( String gene ) {
      this.gene = gene;
   }
}

The above example, use a Matcher to find patterns. There are other String-based algorithms that may be more efficient, but as a starting point, the Matcher offers a generic solution to any type of text pattern search.

Encoding nucleotides as characters (ATCG) is very flexible and convenient, allowing the use of String-based tools to analyze and characterize sequences and/or sub-sequences. Unfortunately, they do not scale well. In such cases, it would be better to consider more specific bioinfomatics techniques for representing and managing sequences.

A good reference on certain techniques, would be Chapter 2 – Algorithms and Data Structures in Next-Generation Sequencing of the book Next Generation Sequencing Technologies and Challenges in Sequence Assembly. A more detailed PDF preview of it is available from this Google link; though I won't guarantee it to work forever.

You may also want to look at BioJava. While, I wouldn't want to detract you from Java, Perl is another good alternative for sequence analysis. Beginning Perl for Bioinformatics; Perl and Bioinformatics; or BioPerl.

I realize that this answer may be TMI; but, if it helps you or others find more appropriate solutions, it served its purpose.

Edit:

Based on the comment below, this appears to be a homework question, given the requirement that the search be accomplished by StringBuilder.indexOf(). The following method would accomplish the search accordingly.

public List<Integer> findBySb( String seq ) {
    List<Integer> indices = new ArrayList<Integer>();
    StringBuilder sb = new StringBuilder( gene );
    int strIdx = 0;

    while ( strIdx < sb.length() ) {
        int idx = sb.indexOf( seq, strIdx );
        if ( idx == -1 )
            break;
        indices.add( idx );
        strIdx = idx + seq.length();
    }

    return indices;
}

The same indexOf() approach can used with the string directly.

public List<Integer> findByString( String seq ) {
    List<Integer> indices = new ArrayList<Integer>();
    int strIdx = 0;

    while ( strIdx < gene.length() ) {
        int idx = gene.indexOf( seq, strIdx );
        if ( idx == -1 )
            break;
        indices.add( idx );
        strIdx = idx + seq.length();
    }

    return indices;
}

Both StringBuilder and String use the same static implementation of String.indexOf(), thus functionally there is no difference. However, instantiating a StringBuilder just for searching is overkill and a little more wasteful since it also allocates buffers to manage string operations. I could go on :), but it doesn't add to the answer.