80.0k views
1 vote
Natural language processing (NLP) is a field of artificial intelligence that seeks to develop the ability of a computer program to understand human language. Usually, the first step of an NLP system is to convert words into numeric codes. Thus, the system converts an input text into a sequence of numeric codes before any high-level analysis. This process is known as text preprocessing.

We can only perform text preprocessing if we have a vocabulary of words and their associated numeric codes. Your task is to create a vocabulary of unique words for a given text file and assign a different number from 1 to N to each unique word, with N being the total number of unique words. You must perform this assignment so that the first word in alphabetical order gets the number 1, the second word in alphabetical order gets the number 2, and so on.
A word is a sequence of letters (uppercase or lowercase). The file is composed of letters and white spaces (spaces, tabs, newlines). White spaces serve as word separators and cannot be part of any word. A file can have multiple consecutive separators. Different case variations of the same word (The, the, and THE) must be considered the same. All vocabulary words must contain uppercase letters only.
Your program will receive two command-line arguments, the name of the input text file and the name of the file where the vocabulary must be saved. Example:
$ ./a.out inputX.txt vocabularyX.txt
Each line of the output file must contain a number (the numeric code) and a word (a unique word) separated by a space, and the words must be in alphabetical order. Below are some examples of input and expected output.
Examples (your program must follow this format precisely)
Example #1
input0.txt
the THE The ha Ha HA
vocabulary0.txt
1 HA
2 THE
Example #2
input1.txt
Lorem ipsum dolor sit amet consectetur adipiscing elit
Ut commodo nec magna et sodales
vocabulary1.txt
1 ADIPISCING
2 AMET
3 COMMODO
4 CONSECTETUR
5 DOLOR
6 ELIT
7 ET
8 IPSUM
9 LOREM
10 MAGNA
11 NEC
12 SIT
13 SODALES
14 UT

1 Answer

5 votes

Final answer:

To create a vocabulary of unique words from a text file, perform text preprocessing by converting the text into numeric codes. Assign a number from 1 to N to each unique word, with N being the total number of unique words. Save the vocabulary in a file with the words in alphabetical order.

Step-by-step explanation:

To create a vocabulary of unique words for a given text file, you need to perform text preprocessing. The first step is to convert the text into a sequence of numeric codes. In this case, we need to assign a different number from 1 to N to each unique word, where N is the total number of unique words. The words should be in alphabetical order, and any different case variations of the same word should be considered the same. The vocabulary should be saved in a separate file.

Here is an example of how the vocabulary file should look:

1 HA

2 THE

...

User Jin Lee
by
8.1k points