Final answer:
In order to satisfy the condition ∣wσ ′ (wa+b)∣≥1, we must have ∣w∣≥4. The set of input activations that satisfy this condition can range over an interval no greater in width than w^2 ln(2/∣w∣(1+ 1-4/∣w∣)−1). Numerically, this expression is greatest at ∣w∣≈6.9, where it takes a value ≈0.45.
Step-by-step explanation:
To start, we have ∣wσ ′ (wa+b)∣≥1. We can rewrite this as ∣(wσ ′) (wa+b)∣≥1. Using the triangle inequality, we have ∣wσ ′∣ ∣(wa+b)∣≥1. Given that ∣wσ ′∣ and ∣(wa+b)∣ are both non-negative, we can conclude that ∣wσ ′∣≥1 and ∣(wa+b)∣≥1.
From the condition ∣wσ ′∣≥1, we can say that ∣w∣ = ∣wσ ′∣≥1. Now, let's consider the condition ∣w∣≥4. For this condition to be true, ∣w∣ must be greater than or equal to 4.
Now, let's move on to the second part of the question. If ∣w∣≥4, we want to find the set of input activations a for which ∣wσ ′ (wa+b)∣≥1. Rearranging the inequality, we have ∣(wa+b)∣≥1/∣wσ ′∣. Since ∣w∣≥4, we can substitute this into the inequality as ∣(wa+b)∣≥1/∣wσ ′∣≥1/∣(wa+b)∣. Solving for a, we have ∣a+b∣≥1/∣wσ ′∣∣w∣. Simplifying further, ∣a+b∣≥1/∣wσ ′∣∣w∣ = 1/∣w∣. This is the maximum width of the set of a.
To numerically check the expression for the maximum width, we can substitute ∣w∣≈6.9 into the expression. We get the maximum width as w^2 ln(2/∣w∣(1+ 1-4/∣w∣)−1) ≈ 0.45. This means that even with perfect alignment, we have a fairly narrow range of input activations which can avoid the vanishing gradient problem.
Learn more about mathematical argument