Entropy

MM 0412/2018


Imagine two people Alice and Bob living in Toronto and Boston, respectively.

Alice (Toronto) goes jogging whenever it is not snowing heavily.

Bob (Boston) doesn't ever go jogging.

Alice's action give information about the weather in Toronto.

Bob's actions give no information. This is because Alice's actions are correlated with the weather in Tornonto, whereas Bob's actions are deterministic.

How can we quantify the notion of information?

The entropy of a discrete random variable XX with pmf pX(x)p_X(x) is

H(X)=xip(xi)logp(xi)=E([log(p(x))](1)H(X) = - \displaystyle \sum_{x_i} p(x_i) \log p(x_i) = - \textrm{E} \left([ \log \left( p (x) \right) \right] \pod{\text{1}}

(entropy_discrete_random_variable)

The entropy measures the expected uncertainty in XX.

H(X)H(X) also approximately refers to how much information we learn on average from one instance of the random variable XX.

Changing the base only changes the value of the entropy by a multiplicative constant. For example,

Hbit(X)=ip(xi)log2p(xi)=log2(10)[ip(xi)logp(xi)]=log210×H(X)H_{\textrm{bit}}(X) =\displaystyle -\sum_{i} p(x_i) \log_2 p(x_i) = \log_2 (10) [\sum_{i} p(x_i) \log p(x_i)] = \log_2 10 \times H(X)

and the base 2 is customarily used for the calculation of entropy.

Example

Assuming a random variable XX

X={0, with prob p1, with prob 1p.X =\left \{ \begin{matrix} 0, & \textrm{ with prob } p \\ 1, & \textrm{ with prob } 1-p . \end{matrix} \right.

The entropy of XX is given as

H(X)=plogp(1p)log(1p)H(X) = -p \log p - (1-p) \log(1-p)

Note that the entropy only depends on the probability distribution pp.

Joint Entropy

Consider two random variables XX, YY, jointly distributed according to the pmf p(x,y)p(x,y). The joint entropy is defined as

H(X,Y)=xi,yjp(xi,yj)logp(xi,yj)H(X,Y)=- \displaystyle \sum_{x_i,y_j} p(x_i,y_j) \log p(x_i,y_j)

The joint entropy measures how much uncertainty there is in the two random variables XX and YY taken together.

Cross entropy

The cross entropy between two probability distributions pp and qq over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an unnatural probability distribution qq, rather than the true distribution pp.

The cross entropy for the distributions p and q over a given set is defined as

Hc(p,q)=Ep[logq](2)H_c(p,q)=\textrm{E}_p[-\log q] \pod{\text{2}}

For discrete pp and qq, Hc(p,q)H_c(p,q) is expressed as

Hc(p,q)=xip(xi)logq(xi)(3)H_c(p,q)=-\displaystyle \sum_{x_i} p(x_i) \log q(x_i) \pod{\text{3}}

The continuous version is

Hc(p,q)=Xp(x)logq(x)dx(4)H_c(p,q) = -\displaystyle \int_X p(x) \log q(x) dx \pod{\text{4}}

(3) and (4) can be further computed as

Hc(p,q)=xip(xi)log1p(xi)+xip(xi)logp(xi)q(xi)=H(p)+DKL(PQ)(5)H_c(p,q)=\displaystyle \sum_{x_i} p(x_i)\log \frac{1}{p(x_i)} + \sum_{x_i} p(x_i)\log \frac{p(x_i)}{q(x_i)}=H(p)+D_{\textrm{KL}}(P||Q)\pod{\text{5}},

and

Hc(p,q)=Xp(x)log1p(x)dx+xp(x)logp(x)q(x)dx=H(p)+DKL(PQ)(6)H_c(p,q)=\displaystyle \int_{X} p(x) \log \frac{1}{p(x)} dx + \int_{x} p(x) \log \frac{p(x)}{q(x)} dx =H(p)+D_{\textrm{KL}}(P||Q) \pod{\text{6}}

where DKL(pq)=xip(xi)logp(xi)q(xi) \displaystyle D_{\textrm{KL}}(p||q)=\sum_{x_i} p(x_i)\log \frac{p(x_i)}{q(x_i)} and DKL(pq)=Xp(x)logp(x)q(x)dx \displaystyle D_{\textrm{KL}}(p||q)=\int_{X} p(x)\log \frac{p(x)}{q(x)}dx are the Kullback-Leibler divergence of q from p for discrete and continuous versions, respectively.

The Kullback-Leibler divergence is always non-negative, DKL(pq)0 \displaystyle D_{\textrm{KL}}(p||q) \geq 0, from Gibbs' inequality.

DKL(pq)=0\displaystyle D_{\textrm{KL}}(p||q)=0 occurs in and only if P=Q.

From (5) and (6), the minimum value of cross entropy Hc(p,q)H_c(p,q) is H(p)H(p) when the probability distribution pp and qq are equivalent.

Application in Language model

As for language model application, p is assumed as the true distribution of words in the training data set T.

q is the distribution of words as predicted by the model.

The cross-entropy is measured on this training data set to assess how accurate the model is in predicting the training data. In this case, an estimate of cross-entropy is calculated as

H(T,q)=k=1K1Klog2q(xi)H(T,q)=\displaystyle -\sum_{k=1}^K \frac{1}{K} \log_2q(x_i)

where K is the size of training set. This is a Monte Carlo estimate of the cross entropy.

[0]

https://en.wikipedia.org/wiki/Cross_entropy

[1]

Lecture 2: McGill University Electrical and Computer Engineering ECSE 612 – Multiuser Communications

results matching ""

    No results matching ""