519 views
0 votes
We want to be able to carry out an analysis of words in long documents to find the most frequently used words. This can be used for example to identify the most important words for language learning or to try to identify authors in literary works. Later on we will ask you to analyse two Shakespeare plays, Hamlet and The Merchant of Venice, to find the 20 most frequent words and the number of times each word occurs. Because the most common words are mainly stop words (articles, prepositions, etc.) and the play's characters (e.g. Hamlet, Horatio, Portia etc.) we will also want the ability to exclude certain words from the analysis. First we want to explore the problem in a more general abstract form. Given the name (string) of a text file containing words, a positive integer m and a text file containing excluded words (strings), find the m most frequent words in the file (apart from the excluded words) and their frequencies, given in descending order of frequency. We define this more formally, as follows: Operation: Most common in file Inputs: filename, a string; excluded-words-filename, a string; m, integer Preconditions: Files of names filename and excluded-words-filename are text files; m > 0 Outputs: most-common-words, a list of at most m items, where each item is a tuple Postconditions: Each item of most-common-words, is a tuple containing a word from the file filename together with its frequency, with the list being in descending order of frequency, and no tuples for words from excluded-words-filename are in the list. The frequency component of each tuple in most-common-words is greater than or equal to the frequency of occurrence for any other words in filename, ignoring any words in excluded-words-filename. Q1(d)(i) (3 marks) The main ADT to use for storing the text should be a bag. You will also need to choose a suitable ADT for the excluded words, and you can also use other standard simple built-in data structures of Python such as lists or strings, if necessary. State what sort of ADTs and data structures you would use for this problem and explain what is stored in these ADTs. Do not explain your choices at this stage - we will ask about that later. Q1(d)(ii) (3 marks) Give a step-by-step explanation, showing how your solution would work. You can make use of any of the bag ADT operations listed in Chapter 8. Q1(d)(iii) (5 marks) Now explain your chosen approach by outlining the characteristics and the expected performance of the operations on bags and other ADTs/data structures you have used, in standard Python implementations. You should reference the performance discussions for bags in Chapter 8 and relevant performance discussions elsewhere in the module text for other ADTs/data structures.

1 Answer

0 votes

Final answer:

The solution involves a bag for storing words, a set for excluded words, and a list for word-frequency tuples. We count word frequency, exclude specified words, sort the list by frequency, and then select the top m words.

Step-by-step explanation:

To address the problem of analyzing text documents for word frequency, we would use several Abstract Data Types (ADTs) and built-in Python data structures:

  • A bag (multiset) to store all words from the input file for easy addition and frequency counting.
  • A set for the excluded words to allow for quick lookup and exclusion.
  • A list to hold the tuples of words and their counts, which can be sorted based on frequency.

Step-by-Step Solution:

  1. Read the words from the text file and add them to a bag; each word is an element, and multiple instances are allowed.
  2. Load the excluded words from the specified file into a set for O(1) lookup times.
  3. Iterate through the bag, count the frequency of each word, and store it in a list as a tuple.
  4. Remove the tuples that contain excluded words.
  5. Sort the list of tuples based on the second element (frequency) in descending order.
  6. Select the top m words by slicing the sorted list.

Performance Explanation:

The bag ADT is chosen for its ability to efficiently handle duplicates, which fits the requirement for word frequency analysis. Bags typically allow for O(1) addition and counting occurrences can be effectively optimized. The set is used for excluded words because it supports O(1) membership testing, which is necessary for checking exclusions quickly. Sorting the list of word count tuples is O(n log n), where n is the number of unique words after exclusion.

User STg
by
7.2k points