167k views
3 votes
Description:

In this assignment, you will implement a deterministic finite automata (DFA) using C++ programming language to extract all matching patterns (substrings) from a given input DNA sequence string.

The alphabet for generating DNA sequences is {A, T, G, C}. Design a deterministic finite automaton to recognize the regular expression A(A+T+G+C)*T
Write a program which asks the user to input a DNA sequence. The program should be able to extract all the patterns (substrings present in the DNA sequence) that match the regular expression given in (1). You MUST implement DFA from (1) to check if all possible substrings in the DNA sequence is a part of the regular expression or not.
Note: even if the same pattern is found multiple times, it should be printed every time it is found. The program should mimic how a DFA processes a string: reading one letter at a time and doing state transitions. Please see Week 7 materials in BlackBoard for reference.

Below are two sample input/output. Only the bolded are user input.

Input a DNA sequence: CATTTGCAGGTG

Matching patterns are:

AT

ATT

ATTT

ATTTGCAGGT

AGGT

Input a DNA sequence: TTTATAAAA

Matching patterns are:

AT

Here is my code I have so far however the only output I can get is AT from either DNA Sequence..... what am I missing or doing wrong??:

#include
#include
#include
using namespace std;

int main() {
string dna;
cout << "Please input a DNA sequence: ";
cin >> dna;


regex re("A(A+T+G+C)*T");

smatch s;
regex_search(dna, s, re);
cout << "Matching patterns are: " << endl;

if (regex_search(dna, s, re)) {
for (int i = 0; i < s.size(); i++) {
ssub_match sub_match = s[i];
string piece = sub_match.str();
cout << piece << '\\';
}


}

return 0;

}

User Cmii
by
9.5k points

1 Answer

5 votes

Final answer:

The student's code using regex_search fails to capture all matching substrings because it only finds the first match. A loop is required to find and print all matching patterns as per the DFA designed to match the regular expression A(A+T+G+C)*T.

Step-by-step explanation:

The issue at hand is that the regex_search function captures only the first match of the regular expression within the input DNA sequence. In order to find all the occurrences of the substring that match your regular expression, you need to use regex_search in a loop, recursively finding subsequent matches starting from the end of the previous match.

To implement this using a deterministic finite automata (DFA) approach in C++, you should keep track of the state you're in (starting, middle, or accepting) and iterate over each character of the DNA sequence, looking for a match by transitioning through the states according to your DFA design. For each 'A' you encounter, start a new state-tracking iteration that checks if it is followed by a sequence of DNA letters and ends in a 'T', constituting a match. Print the match and continue scanning. You will need a nested loop—one scanning the entire input string and an inner loop checking each possible substring starting with 'A' and ending with 'T'.

User Manish Goyal
by
8.5k points