2 Multiclass Naive Bayes with Bag of Words A group of artists wish to use Naive Bayes algorithm to classify a given artwork into three different categories given a text description of the painting. These descriptions are short and simple and have already been passed through a feature function which returned the following key features based on the count of certain words used to describe the painting. The three categories are related to the overall color scheme and are as follows: Warm, Cool and Neutral. A set of these descriptions have been sampled and each were classified by the artists based on their feature vectors. The data collected so far is given in the table below: a. (1 pt) What is the probability
θ y
of each label
y∈{
Warm, Neutral, Cool
}
? b. (3 pts) The parameter
ϕ y,j
is the probability of a token
j
appearing with label
y
. It is defined by the following equation, where
V
is the size of the vocabulary set:
ϕ y,j
= ∑ j ′
=1
V
count(y,j ′
)
count(y,j)
The probability of a count of words
x
and a label
y
is defined as follows. Here, count
(y,j)
represents the frequency of word
j
appearing with label
y
over all data points.
p(x,y;θ,ϕ)=p(y;θ)⋅p(x∣y;ϕ)=p(y;θ) j=1
∏
V
ϕ y,j
x j
Here, the words are the names of colors that appear in the text description of the artwork, and a word count vector indicates the occurrence of each of the words in the text description for a given artwork. Find the most likely label
y
^
for the following word counts vector
x=(0,1,0,1,1,0,0,1)
using
y
^
=argmax y
logp(x,y;θ;ϕ)
. Show final
log
(base-10) probabilities for each label rounded to 3 decimals. Treat
log(0)
as
−[infinity]
. (Hint: read more about binary multinomial naive Bayes in Jurafsky Martin Chapter 4, as well as Hiroshi Shimodaira's note - https://www . inf . ed.ac. uk/teaching/ courses/inf2b/learnnotes/inf2b-learn-note07-2up.pdf.) c. (3 pts) When calculating argmax
x y
, if
ϕ y,j
=0
for a label-word pair, the label
y
is no longer considered. This is an issue, especially for smaller datasets where a feature may not be present in all documents for a certain label. One approach to mitigating this high variance is to smooth the probabilities. Using add-1 smoothing, which redefines
ϕ y,j
, again find the most likely label
y
^
for the following word counts vector
x=(0,1,0,1,1,0,0,1)
using
y
^
=argmax y
logp(x,y;θ;ϕ)
. Make sure to show final
log
probabilities.
add-1 smoothing: ϕ y,j
= V+∑ j ′
=1
V
count(y,j ′
)
1+count(y,j)