98.9k views
4 votes
Can someone help me with this python lab? Please answer with python code! PLEASE DO NO COPY OTHER CODE OR I WILL HAVE TO GIVE A DOWNVOTE!!! This lab requires you to import the re module

Function to complete: split_ngrams(text, n)

Write a function that takes as arguments a string text and an integer n, and extracts all of the N-grams of size n from the text.

An N-gram is a sequence of n words that occur next to each other in the text. For example, the text cats and dogs contains the 2-grams ("bigrams") cats and and and dogs. For this task, we will treat an N-gram as a tuple consisting of n strings, such as ("cats", "and"), ("and", "dogs").

N-grams are usually extracted from complete sentences, which means that we need to mark the edges of the sentence somehow. To do this, we add the pretend words "START" and "END" to the start and end of the sentence, respectively. As a result, the first N-gram in the sentence will have "START" as its first item, and the last N-gram in the sentence will have "END" as its last item. For example, the bigrams in the sentence It's raining cats and dogs! would be ("START", "it's"), ("it's", "raining"), ("raining", "cats"), ("cats", "and"), ("and", "dogs"), and ("dogs", "END").

For this task, each text you will be splitting is a single sentence. In this case, you should proceed as follows:

Import the re module in python. Split the sentence into a list of words, lowercasing each word and removing word-external punctuation, through a function like count_words(), which is a function that takes a long text string as an argument and returns a dictionary that counts how many times each word in the string occurred - recommended to write a function extract_words() for this purpose

Add pretend words "START" and "END" to the start and end, respectively, of the list of words.

Form the N-grams as tuples of exactly n items in a row from the list. The first N-gram should have "START" as its first item, and the last one should have "END" as its last item.

Your function should return a list of tuples, where each tuple represents an N-gram.

For example, split_ngrams("I thought I'd made a 'mistake'.", 3) should return [("START", "i", "thought"), ("i", "thought", "i'd"), ("thought", "i'd", "made"), ("i'd", "made", "a"), ("made", "a", "mistake"), ("a", "mistake", "END")].

User Lazloman
by
8.4k points

1 Answer

7 votes

Answer:

import re

def split_ngrams(text, n):

# Add "START" and "END" to the sentence

sentence = "START " + text.lower() + " END"

# Remove word-external punctuation

sentence = re.sub(r"[^\w\s]+", "", sentence)

# Split the sentence into a list of words

words = sentence.split()

# Extract all N-grams of size n from the list of words

ngrams = []

for i in range(len(words) - n + 1):

ngram = tuple(words[i:i+n])

ngrams.append(ngram)

return ngrams

# Example usage

text = "It's raining cats and dogs!"

n = 2

print(split_ngrams(text, n)) # Output: [('START', 'its'), ('its', 'raining'), ('raining', 'cats'), ('cats', 'and'), ('and', 'dogs'), ('dogs', 'END')]

User Sahaquiel
by
7.4k points

No related questions found