43.2k views
0 votes
When an author produce an index for his or her book, the first step in this process is to decide which words should go into the index; the second is to produce a list of the pages where each word occurs. Instead of trying to choose words out of our heads, we decided to let the computer produce a list of all the unique words used in the manuscript and their frequency of occurrence. We could then go over the list and choose which words to put into the index.

The main object in this problem is a "word" with associated frequency. The tentative definition of "word" here is a string of alphanumeric characters between markers where markers are white space and all punctuation marks; anything non-alphanumeric stops the reading. If we skip all un-allowed characters before getting the string, we should have exactly what we want. Ignoring words of fewer than three letters will remove from consideration such as "a", "is", "to", "do", and "by" that do not belong in an index.

In this project, you are asked to write a program to read any text file and then list all the "words" in alphabetic order with their frequency together appeared in the article. The "word" is defined above and has at least three letters.

User Danbanica
by
5.2k points

1 Answer

4 votes

Answer:

import string

dic = {}

book=open("book.txt","r")

# Iterate over each line in the book

for line in book.readlines():

tex = line

tex = tex.lower()

tex=tex.translate(str.maketrans('', '', string.punctuation))

new = tex.split()

for word in new:

if len(word) > 2:

if word not in dic.keys():

dic[word] = 1

else:

dic[word] = dic[word] + 1

for word in sorted(dic):

print(word, dic[word], '\\')

book.close()

Step-by-step explanation:

The code above was written in python 3.

import string

Firstly, it is important to import all the modules that you will need. The string module was imported to allow us carry out special operations on strings.

dic = {}

book=open("book.txt","r")

# Iterate over each line in the book

for line in book.readlines():

tex = line

tex = tex.lower()

tex=tex.translate(str.maketrans('', '', string.punctuation))

new = tex.split()

An empty dictionary is then created, a dictionary is needed to store both the word and the occurrences, with the word being the key and the occurrences being the value in a word : occurrence format.

Next, the file you want to read from is opened and then the code iterates over each line, punctuation and special characters are removed from the line and it is converted into a list of words that can be iterated over.

for word in new:

if len(word) > 2:

if word not in dic.keys():

dic[word] = 1

else:

dic[word] = dic[word] + 1

For every word in the new list, if the length of the word is greater than 2 and the word is not already in the dictionary, add the word to the dictionary and give it a value 1.

If the word is already in the dictionary increase the value by 1.

for word in sorted(dic):

print(word, dic[word], '\\')

book.close()

The dictionary is arranged alphabetically and with the keys(words) and printed out. Finally, the file is closed.

check attachment to see code in action.

When an author produce an index for his or her book, the first step in this process-example-1
User IhtkaS
by
4.9k points