Entropy

MM 0412/2018

Imagine two people Alice and Bob living in Toronto and Boston, respectively.

Alice (Toronto) goes jogging whenever it is not snowing heavily.

Bob (Boston) doesn't ever go jogging.

Alice's action give information about the weather in Toronto.

Bob's actions give no information. This is because Alice's actions are correlated with the weather in Tornonto, whereas Bob's actions are deterministic.

How can we quantify the notion of information?

The entropy of a discrete random variable $X$ with pmf $p_X(x)$ is

$H(X) = - \displaystyle \sum_{x_i} p(x_i) \log p(x_i) = - \textrm{E} \left([ \log \left( p (x) \right) \right] \pod{\text{1}}$

(entropy_discrete_random_variable)

The entropy measures the expected uncertainty in $X$ .

$H(X)$ also approximately refers to how much information we learn on average from one instance of the random variable $X$ .

Changing the base only changes the value of the entropy by a multiplicative constant. For example,

$H_{\textrm{bit}}(X) =\displaystyle -\sum_{i} p(x_i) \log_2 p(x_i) = \log_2 (10) [\sum_{i} p(x_i) \log p(x_i)] = \log_2 10 \times H(X)$

and the base 2 is customarily used for the calculation of entropy.

Example

Assuming a random variable $X$

$X =\left \{ \begin{matrix} 0, & \textrm{ with prob } p \\ 1, & \textrm{ with prob } 1-p . \end{matrix} \right.$

The entropy of $X$ is given as

$H(X) = -p \log p - (1-p) \log(1-p)$

Note that the entropy only depends on the probability distribution $p$ .

Joint Entropy

Consider two random variables $X$ , $Y$ , jointly distributed according to the pmf $p(x,y)$ . The joint entropy is defined as

$H(X,Y)=- \displaystyle \sum_{x_i,y_j} p(x_i,y_j) \log p(x_i,y_j)$

The joint entropy measures how much uncertainty there is in the two random variables $X$ and $Y$ taken together.

Cross entropy

The cross entropy between two probability distributions $p$ and $q$ over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an unnatural probability distribution $q$ , rather than the true distribution $p$ .

The cross entropy for the distributions p and q over a given set is defined as

$H_c(p,q)=\textrm{E}_p[-\log q] \pod{\text{2}}$

For discrete $p$ and $q$ , $H_c(p,q)$ is expressed as

$H_c(p,q)=-\displaystyle \sum_{x_i} p(x_i) \log q(x_i) \pod{\text{3}}$

The continuous version is

$H_c(p,q) = -\displaystyle \int_X p(x) \log q(x) dx \pod{\text{4}}$

(3) and (4) can be further computed as

$H_c(p,q)=\displaystyle \sum_{x_i} p(x_i)\log \frac{1}{p(x_i)} + \sum_{x_i} p(x_i)\log \frac{p(x_i)}{q(x_i)}=H(p)+D_{\textrm{KL}}(P||Q)\pod{\text{5}}$ ,

and

$H_c(p,q)=\displaystyle \int_{X} p(x) \log \frac{1}{p(x)} dx + \int_{x} p(x) \log \frac{p(x)}{q(x)} dx =H(p)+D_{\textrm{KL}}(P||Q) \pod{\text{6}}$

where $\displaystyle D_{\textrm{KL}}(p||q)=\sum_{x_i} p(x_i)\log \frac{p(x_i)}{q(x_i)}$ and $\displaystyle D_{\textrm{KL}}(p||q)=\int_{X} p(x)\log \frac{p(x)}{q(x)}dx$ are the Kullback-Leibler divergence of q from p for discrete and continuous versions, respectively.

The Kullback-Leibler divergence is always non-negative, $\displaystyle D_{\textrm{KL}}(p||q) \geq 0$ , from Gibbs' inequality.

$\displaystyle D_{\textrm{KL}}(p||q)=0$ occurs in and only if P=Q.

From (5) and (6), the minimum value of cross entropy $H_c(p,q)$ is $H(p)$ when the probability distribution $p$ and $q$ are equivalent.

Application in Language model

As for language model application, p is assumed as the true distribution of words in the training data set T.

q is the distribution of words as predicted by the model.

The cross-entropy is measured on this training data set to assess how accurate the model is in predicting the training data. In this case, an estimate of cross-entropy is calculated as

$H(T,q)=\displaystyle -\sum_{k=1}^K \frac{1}{K} \log_2q(x_i)$

where K is the size of training set. This is a Monte Carlo estimate of the cross entropy.

[0]

https://en.wikipedia.org/wiki/Cross_entropy

[1]

Lecture 2: McGill University Electrical and Computer Engineering ECSE 612 – Multiuser Communications

Cross_Entropy

Entropy

Example

Joint Entropy

Cross entropy

Application in Language model

results matching ""

No results matching ""