application of dna sequence with fastai

31 Views Asked by At

I am trying to apply fastai for rna prediction from dna sequence.

Current bottelneck is to use datablock for my dna sequence data. I have score for each sequences of 500 character(ACGT) long. And using follow function for transforming the DNA sequence to partially type of image(= tensor metric).

def one_hot_encode(seq):
    """
    Given a DNA sequence, return its one-hot encoding
    """
    # Make sure seq has only allowed bases
    allowed = set("ACTGNactgn")
    if not set(seq).issubset(allowed):
        invalid = set(seq) - allowed
        raise ValueError(f"Sequence contains chars not in allowed DNA alphabet (ACGTN): {invalid}")
    # Dictionary returning one-hot encoding for each nucleotide 
    nuc_d = {'A':[1.0,0.0,0.0,0.0],
             'C':[0.0,1.0,0.0,0.0],
             'G':[0.0,0.0,1.0,0.0],
             'T':[0.0,0.0,0.0,1.0],
            }
    # Create array from nucleotide sequence
    vec=np.array([nuc_d[x] for x in seq])
    return vec 

However I am facing the challenge to setup the datablock and dataloaders for it. It will be grateful to have any tips

0

There are 0 best solutions below