Why probability density function can be seen as density of a measure with respect to another measure – an example of N(0,1)

This was originally written on Nov 3, 2013, for the probability theory course I was serving as TA.

Converted from .tex using latex2wp.

Usually, we say a random variable {X} follows a Normal(0,1) distribution, if its cumulative distribution can be expressed as:

\displaystyle P\{X\leq t\}=\int_{-\infty}^{t}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}dx.

Now we formalize this in a more measure-theoretic way, in correspondence to what we learned in the course, particularly, why the part {\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}} is called the density of {X}: How is the term “probability density function” that we use a lot in statistics related to the concept “density” (density of a measure with respect to another measure) that we learned in class?

First of all, we need to adopt a definition of Normal(0,1) random variable. Say {X} is a random variable (i.e. measurable function) from {(\Omega,\mathscr{F})} to {(\mathbb{R},\mathscr{R})}. Denote {P} some probability measure on {(\Omega,\mathscr{F})} and {\mu} the Lebesgue measure on {(\mathbb{R},\mathscr{R})}. We say {X} is a Normal(0,1) random variable, if we have (this is the definition we adopt, i.e. a starting point for the following arguments)

\displaystyle P\{\omega:X(\omega)\leq t\}=\int_{(-\infty,t]}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}d\mu(x). \ \ \ \ \ (1)

Now, how to convert this into a statement that {\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}} is the density of some measure with respect to some other measure? Note that when saying some function {D} is the density of some measure {\rho} with respect to some other measure {\mu}, {\rho} and {\mu} need to be defined on the same measurable space, so at this point we cannot say {\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}} is the density of {P} with respect to {\mu}.

But now the distribution comes to rescue. Recall that at some ealier time point of the class, we’ve learned the concept “distribution” of a random variable, which is a measure {L_{X}} on the target space (here {(\mathbb{R},\mathscr{R})}) defined as following: for any {A\in\mathscr{R}},

\displaystyle L_{X}(A):=PX^{-1}(A)=P\{\omega:X(\omega)\in A\}. \ \ \ \ \ (2)

So by (1) we have

\displaystyle L_{X}((-\infty,t])=\int_{(-\infty,t]}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}d\mu(x), \ \ \ \ \ (3)

or (by some careful treatment of the fact that {\mathscr{R}} is the sigma-field generated by all half-infinity intervals and the properties of measure)

\displaystyle L_{X}(A)=\int_{A}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}d\mu(x), \ \ \ \ \ (4)

for any {A\in\mathscr{R}}.

That is to say, {\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}} is the density of {L_{X}} (distribution of {X}, which is a probability measure) with respect to {\mu} (the Lebesgue measure on the real line).

Illustration of Conditional Expectation: a random variable smoothed on a sigma field

This was originally written on Nov 25, 2013, for the probability theory course I was serving as TA.

Converted from .tex using latex2wp.

In this note, we explained why the conditional expectation of a random variable {Y} given a {\sigma}-field {\mathscr{G}} can be seen as “smoothed version of {Y} over {\mathscr{G}}” (in Example 2), and we briefly related the definition of conditional expectation to the elementary {E\left(Y\mid X=x\right)} notation.

1. Preliminaries

Let {(\Omega,\mathscr{F})} and {(\Omega',\mathscr{F}')} be two measruable spaces (that is to say, {\mathscr{F}} is a {\sigma}-field on {\Omega}, and {\mathscr{F}'} on {\Omega'}), and let {X} be a mapping from {\Omega} to {\Omega'}. We say {X} is {\mathscr{F}/\mathscr{F}'}-measurable, if {X^{-1}(B):=\left\{ \omega\in\Omega:X\left(\omega\right)\in B\right\} \in\mathscr{F},\forall B\in\mathscr{F}'}. Usually, when there is no confusion, we abbreviate {X} is {\mathscr{F}/\mathscr{\mathscr{F}}'}-measurable as {X} is {\mathscr{F}}-measurable. Another convention is that, when people write {X:\left(\Omega,\mathscr{F}\right)\rightarrow(\Omega',\mathscr{F}')}, sometimes they are implying {X} is {\mathscr{F}/\text{\ensuremath{\mathscr{F}}'}}-measurable (and we’ll adopt this convention in this note).

Example 1 Say {X:\left(\Omega,\mathscr{F}\right)\rightarrow\left(\mathbb{R},\mathscr{B}\left(\mathbb{R}\right)\right)}, i.e. assume {X} is {\mathscr{F}/\text{\ensuremath{\mathscr{B}\left(\mathbb{R}\right)}}}-measurable, where {\mathscr{B}\left(\mathbb{R}\right)} denotes the Borel {\sigma}-field on the real line. Consider the following figure (see top of next page), where the large rectangle represents {\Omega}, and the small grids represent {\mathscr{F}}, so in this case the {\sigma}-field {\text{\ensuremath{\mathscr{F}}}} consists of all the small cells, and all combinations of them. In this case, say {A} is one of the “finest” element in {\mathscr{F}}, indicated in red (the “finest” here means that there is no nonempty subset of {A} that is also in {\mathscr{F}}), then {X} must be constant on {A}, i.e. there exists some {c\in\mathbb{R}} such that {X(\omega)=c\;\forall\omega\in A}. This is because, if {X} could take two values on {A}, say {X\left(\omega_{1}\right)=c_{1}} and {X\left(\omega_{2}\right)=c_{2}} for some {\omega_{1},\omega_{2}\in A} where {c_{1}\neq c_{2}}, then {X^{-1}\left(\left\{ c_{1}\right\} \right)} must be a part of {A} and maybe along with some other part of {\Omega} outside {A}, illustrated in blue, and hence this pullback cannot be in {\mathscr{F}} (recall that {\mathscr{F}} consists of all the small cells, and all combinations of them), which violates the assumption that {X} is {\mathscr{F}/\text{\ensuremath{\mathscr{B}\left(\mathbb{R}\right)}}}-measurable. Note that this observation is also true when the target space is a general {(\Omega',\mathscr{F}')}, i.e., when {X:\left(\Omega,\mathscr{F}\right)\rightarrow(\Omega',\mathscr{F}')}, because {X} is {\mathscr{F}}-measurable, {X} must be constant on each “finest” piece of {\mathscr{F}}.


In particular, take {\mathscr{F}} to be {\sigma\left(X\right)} (recall that {\sigma\left(X\right)} is defined as the smallest {\sigma}-field such that {X} is measurable). Since {X} is always {\sigma\left(X\right)} measurable , we know that roughly speaking {X} is constant on each “finest” piece of {\sigma\left(X\right)}. To me, this is why people always say that {\sigma\left(X\right)} “contains the information” of {X}: by knowing how “fine” {\sigma\left(X\right)} is, one knows how “complicated” {X} is, i.e., how many different values {X} could take.

However, do note that this observation is valid when {\sigma\left(X\right)} or {\mathscr{F}} is “discrete” in the sense that you can identify the “finest” grid, like in the above example. Otherwise, if say {\left(\Omega,\mathscr{F}\right)=\left(\mathbb{R},\mathscr{B}\left(\mathbb{R}\right)\right)}, then we cannot find a “finest” piece of {\mathscr{F}} – you can say a single number in {\mathbb{R}} is a “finest” piece of {\mathscr{B}\left(\mathbb{R}\right)}, but this won’t be useful, because {X} will always be constant on a single point in the sample space.

2. Illustration of conditional expectation

(Billinsley, Section 34, P445) Suppose {Y} is an integrable random variable on {\left(\Omega,\mathscr{F},P\right)}, and that {\mathscr{G}} is a sub {\sigma}-field of {\mathscr{F}} (i.e. {\mathscr{G}\subset\mathscr{F}}). Then there exists a function {Z:\Omega\rightarrow\Omega'}, called the conditional expectation of {Y} given {\mathscr{G}}, denoted as {E\left(Y\mid\mathscr{G}\right)}, such that {Z} has the following two properties:

  1. {Z} is {\mathscr{G}}-measurable and integrable;
  2. {Z} satisfies the following equation:

    \displaystyle \int_{A}Z\, dP=\int_{A}Y\, dP,\quad\forall A\in\mathscr{G}. \ \ \ \ \ (1)

One can show that the conditional expectation is a.s. unique, i.e., if {Z} and {W} both satisfy the above two conditions, then {Z=W} a.s.

In the following example, we illustrate how the conditional expectation {E\left(Y\mid\mathscr{G}\right)} can be seen as the “smoothed version of {Y} over {\mathscr{G}}”.

Example 2 Suppose {Y} is an integrable random variable on {\left(\Omega,\mathscr{F},P\right)}, and that {\mathscr{G}} is a sub {\sigma}-field of {\mathscr{F}}. The following two figures represent {\mathscr{F}} and {\mathscr{G}}, respectively. So {\mathscr{F}} contains all the 24 rectangles with grey edges and all combinations of them, while {\mathscr{G}} contains all the 6 rectangles with red edges and all combinations of them. Note that {\mathscr{F}} is a “finer” partition of {\Omega} then {\mathscr{G}}.


Now, consider the 4 grey cells {A_{1},A_{2},A_{3},A_{4}} in the figure, and call their union {A=\bigcup_{i=1}^{4}A_{i}}, note that {A\in\mathscr{G}}, and {A} can be seen as a “finest” piece of {\mathscr{G}}. Denote {Z:=E\left(Y\mid\mathscr{G}\right)}. By definition {Z} is {\mathscr{G}}-measurable, so {Z} need to be constant on {A} (as explained in Example 1). Also, by definition we need {Z} to satisfy the equation:

\displaystyle \int_{A}Z\, dP=\int_{A}Y\, dP=\sum_{i=1}^{4}\int_{A_{i}}Y\, dP. \ \ \ \ \ (2)

Using {Z} is constant on {A} and {Y} is constant on each {A_{i}}, we have (with a little abuse of notation, denote the constant value of {Z} on {A} as {Z(A)}, and similar to {Y})

\displaystyle Z\left(A\right)P\left(A\right)=\sum_{i=1}^{4}Y\left(A_{i}\right)P\left(A_{i}\right), \ \ \ \ \ (3)


\displaystyle Z\left(A\right)=\frac{1}{P\left(A\right)}\sum_{i=1}^{4}P\left(A_{i}\right)Y\left(A_{i}\right)=\sum_{i=1}^{4}\frac{P\left(A_{i}\right)}{P\left(A\right)}Y\left(A_{i}\right). \ \ \ \ \ (4)

This means {Z(A)} is the average of the 4 values that {Y} takes on {A_{i}}‘s, with weight proportional to the probability of {A_{i}}. This observation extends to other cells of the grid as well.To conclude what we get from this example, we find that on each cell of {\mathscr{G}}, {E\left(Y\mid\mathscr{G}\right)} is constant, and the constant value equals the weighted average of the values {Y} can take within this cell. So with this observation, we can say that {E\left(Y\mid\mathscr{G}\right)} is a “weighted average version of {Y} over {\mathscr{G}}”.

More generally, if {\mathscr{G}} is a rather complicated {\sigma}-field than in this example, when considering {E\left(Y\mid\mathscr{G}\right)}, instead of calling it a “weighted average version of {Y} over {\mathscr{G}}”, we can call it a ““smoothed version of {Y} over {\mathscr{G}}”. In other words, {E\left(Y\mid\mathscr{G}\right)} tries its best to perform just like {Y}, under the constraint that {E\left(Y\mid\mathscr{G}\right)} needs to be {\mathscr{G}}-measurable.

One can also substitute {\mathscr{G}} by {\sigma\left(X\right)} for some {X}, assuming {\sigma\left(X\right)\subset\mathscr{F}}, and the logic follows the same way.

3. Relation to conditional expectation defined in introductory probability course

How is the above definition of conditional expectation related to the definition in introductory probability courses, i.e. {E\left(Y\mid X=x\right)}?

Given {\left(\Omega,\mathscr{F},P\right)}, let {Y:\left(\Omega,\mathscr{F}\right)\rightarrow\left(\mathbb{R},\mathscr{B}\left(\mathbb{R}\right)\right)}, and {X:\left(\Omega,\sigma\left(X\right)\right)\rightarrow\left(\Omega',\mathscr{F}'\right)}, with {\sigma\left(X\right)\subset\mathscr{F}}. For the usual {E\left(Y|X=x\right)}, consider it as a function of {x}, and denote it by {g\left(x\right)}, i.e., {g:\Omega'\rightarrow\mathbb{R},g\left(x\right)=E\left(Y|X=x\right)}. Now let {h:=g\circ X:\Omega\rightarrow\mathbb{R},h\left(\omega\right)=g\left(X\left(\omega\right)\right)}, as in the following figure (figure adopted from R. Ash’s Probability and Measure Theory):


One can show that the random variable {h} here serves as the conditional expectation {E\left(Y\mid\sigma\left(X\right)\right)}. For detailed proof, see R. Ash’s Probability and Measure Theory (2nd edition) Section 5.4, P215-216.


  1. Billingsley, P. (2008). Probability and measure. John Wiley & Sons.
  2. Ash, R. B., & Doleans-Dade, C. A. (2000). Probability and measure theory. Access Online via Elsevier.