78.2k views
0 votes
Write a function ngrams(n, tokens) that produces a list of all n-grams of the specified size from the input token list. Each n-gram should consist of a 2-element tuple (context, token), where the context is itself an (n-1)-element tuple comprised of the n-1 words preceding the current token. The sentence should be padded with n-1 "" tokens at the beginning and a single "" token at the end. If n = 1, all contexts should be empty tuples. You may assume that n ≥ 1.

>>> ngrams(1, 'abc')
[('~', 'a'), ('a', 'b'), ('b', 'c')]
>>> ngrams(2, 'abc')
[('~~', 'a'), ('~a', 'b'), ('ab', 'c')]

User Anayansi
by
4.8k points

1 Answer

0 votes

Answer:

Step-by-step explanation:

Assuming input is a string contains space separated words,

like x = "a b c d" we can use the following function

def ngrams(input, n):

input = input.split(' ')

output = []

for i in range(len(input)-n+1):

output.append(input[i:i+n])

return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):

grams.setdefault(g, 0)

grams[g] += 1

Putting all together

def ngrams(input, n):

input = input.split(' ')

output = {}

for i in range(len(input)-n+1):

g = ' '.join(input[i:i+n])

output.setdefault(g, 0)

output[g] += 1

return output

ngrams('a a a a', 2) # {'a a': 3}

User Chuck Lantz
by
4.1k points