Tag Archives: no-growth algorithm

How to retrieve a set of sequences from within a FASTA file with Python

A common need in bioinformatics is to extract a subset of sequences from within a FASTA file. You might only want sequences from a particular taxon, sequences that were matched in a BLAST search, sequences that you chose by throwing a dart on a map of South America — the reasons are endless. Imagining a file with five nucleotide sequences labeled Seq1-Seq5, and that you only want odd numbered sequences, like so:

Original fasta file: List of genes you want: Desired output:
>Seq1
ACTGCGTATCGACTAGCTA …
>Seq2
TACTATATCAGTGTCGCGT …
>Seq3
ATGATATGTCGGCCGGTTG …
>Seq4
GTATTGATGCATCAGTCGT …
>Seq5
AATGCGTAAGTAGTCGCGT …
Seq1
Seq3
Seq5
>Seq1
ACTGCGTATCGACTAGCTA …
>Seq3
ATGATATGTCGGCCGGTTG …
>Seq5
AATGCGTAAGTAGTCGCGT …

Once more, Python to the rescue! Create a separate text file with the identifier names of interest (like the second column above), and their extraction can be achieved quickly and easily with the following script:

#! /usr/bin/env python
import sys
import os
# A script for extracting certain sequences from within a FASTA file.

# First, convert FASTA file into file with one line per sequence.
# Make sure the name of your FASTA file doesn't contain any dots 
# besides the one before the extension!
linenum = 0
onelinefile = 'allononeline_' + sys.argv[1].partition('.')[0] + '.tmp'
with open(onelinefile, 'w') as outfile:
    for line in open(sys.argv[1], 'r'):
        line = line.strip()
        if len(line) > 0:
            if line[0] == '>':
                if linenum == 0:
                    outfile.write(line + '\t')
                    linenum += 1
                else:
                    outfile.write('\n' + line + '\t')
            else:
                outfile.write(line)

# Populate a dictionary using the 'allononeline' file
all_seqs = {}
with open(onelinefile, 'r') as allseqsfile:
	for line in allseqsfile:
		line = line.strip().split('\t')
		all_seqs[line[0][1:]] = line[1]

# Generate a set of the sequences you wish to retrieve
desired_seqs = set()
with open(sys.argv[2], 'r') as listfile:
	for line in listfile:
		desired_seqs.add(line.strip())

# Find the overlap between the total sequences and the desired ones
all_seqs_names = set(all_seqs.keys())
toextract = all_seqs_names.intersection(desired_seqs)

# Use 'toextract' set to generate desired file
with open('justdesired_' + sys.argv[1], 'w') as extractfile:
	for name in toextract:
		extractfile.write('>' + name + '\n' + all_seqs[name] + '\n')

os.remove(onelinefile)

Run on command line:

python seqretriever.py your_fasta_file.fa desired_sequences.txt

Lines 9-22 create a temporary deinterleaved version of your FASTA file, except with identifiers and sequences on one line rather than two. This is done so they can easily be populated into a dictionary all_seqs on lines 25-29. The set of desired sequences desired_seqs is created on lines 32-35 by pulling from an external file of sequence names.

The keys (identifiers) within all_seqs are then searched for overlap with desired_seqs, and the overlapping names are entered into toextract on lines 38-40. These are used to pull out desired sequences (which are stored as values of the identifier keys) from all_seqs, which are exported into the final justdesired FASTA file on lines 42-44.

Note that we are using sets — unordered collections of unique elements. Because sets do not record order of insertion, the order of the output cannot be controlled, and will likely be different than the order of input. We do this because detecting overlap between sets and dictionaries is much faster than scanning iterable sequences/lists. The former is an O(1) algorithm, meaning its computational time is independent of the size of the dataset, whereas the latter is O(N), meaning its computational time is linearly proportional to the size of the dataset. As you can imagine, once your dataset becomes large enough (e.g., FASTA files with tens of thousands of sequences), you will always want to find a no-growth algorithmic solution! Sets and dictionaries are great solutions for this kind of rapid membership/overlap testing. Happy coding!