It is commonly not taught explicitly, but in machine learning you quite often come across problems which contain the following quantity and knowing the trick can help a lot. Let's say we have an n-dimensional vector and want to calculate:
y=logi=1∑nexi
For example in filtering problems for which the posterior is calculated recursively:
p(ht∣v1:t)≡α(ht)=p(vt∣ht)ht∏α(ht−1)p(ht∣ht−1)
As α can be quite small, it's a common method to solve the problem in log-space:
logα(ht)=logp(vt∣ht)+loght∑exp(logα(ht−1)+logp(ht∣ht−1))
Another example is a multinomial distribution which you want to parameterize with softmax, like e.g. logistic regression with more than 2 unordered categories. If you now want to calculate the log-likelihood you get the quantity due to the normalization constant.
Both problems have in common, that if you try to calculate it naively, you quite quickly will encounter underflows or overflows, depending on the scale of xi. Even if you work in log-space, the limited precision of computers is not enough and the result will be INF or -INF. So what can we do?
We can show, that the following equation holds:
logi=1∑nexi=a+logi=1∑nexi−a
For an arbitrary a. This means, you can shift the center of the exponential sum. A typical value is setting a to the maximum, which forces the greatest value to be zero and even if the other values would underflow, you get a reasonable result:
a=imaxxi
Proof of log-sum-exp trick
⇔⇔⇔⇔⇔yeye−aeyey−ay−ay=logi=1∑nexi=i=1∑nexi=e−ai=1∑nexi=i=1∑ne−aexi=logi=1∑nexi−a=a+logi=1∑nexi−a