Entropy
MM 0412/2018
Imagine two people Alice and Bob living in Toronto and Boston, respectively.
Alice (Toronto) goes jogging whenever it is not snowing heavily.
Bob (Boston) doesn't ever go jogging.
Alice's action give information about the weather in Toronto.
Bob's actions give no information. This is because Alice's actions are correlated with the weather in Tornonto, whereas Bob's actions are deterministic.
How can we quantify the notion of information?
The entropy of a discrete random variable with pmf is
(entropy_discrete_random_variable)
The entropy measures the expected uncertainty in .
also approximately refers to how much information we learn on average from one instance of the random variable .
Changing the base only changes the value of the entropy by a multiplicative constant. For example,
and the base 2 is customarily used for the calculation of entropy.
Example
Assuming a random variable
The entropy of is given as
Note that the entropy only depends on the probability distribution .
Joint Entropy
Consider two random variables , , jointly distributed according to the pmf . The joint entropy is defined as
The joint entropy measures how much uncertainty there is in the two random variables and taken together.
Cross entropy
The cross entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an unnatural probability distribution , rather than the true distribution .
The cross entropy for the distributions p and q over a given set is defined as
For discrete and , is expressed as
The continuous version is
(3) and (4) can be further computed as
,
and
where and are the Kullback-Leibler divergence of q from p for discrete and continuous versions, respectively.
The Kullback-Leibler divergence is always non-negative, , from Gibbs' inequality.
occurs in and only if P=Q.
From (5) and (6), the minimum value of cross entropy is when the probability distribution and are equivalent.
Application in Language model
As for language model application, p is assumed as the true distribution of words in the training data set T.
q is the distribution of words as predicted by the model.
The cross-entropy is measured on this training data set to assess how accurate the model is in predicting the training data. In this case, an estimate of cross-entropy is calculated as
where K is the size of training set. This is a Monte Carlo estimate of the cross entropy.
[0]
https://en.wikipedia.org/wiki/Cross_entropy
[1]
Lecture 2: McGill University Electrical and Computer Engineering ECSE 612 – Multiuser Communications