Bayes’ Theorem#

Starting from the two expressions for the product rule of Probability Theory:

(7)#\[ p(X,Y) = p(Y\vert X)p(X) \]
(8)#\[ p(X,Y) = p(X\vert Y)p(Y) \]

We see that there are two ways to express the joint distribution with different conditionals. From the discussion in Graph Models, you can see that each expression corresponds to a different graph. Try to draw the graphs in this case and see how the link relating \(X\) and \(Y\) is inverted when we change between the two expressions above.

Substituting one of the expressions above into the other we get:

(9)#\[ p(Y\vert X) = \frac{p(X\vert Y)p(Y)}{p(X)} \]

which is known as Bayes’ Theorem. This is a very important relation at the core of what makes machine learning tools work. In the context or probabilistic graph models, it can be seen as a way to invert links in a graph. The motivation for doing that will become clear in the following.

Probabilities as a measure of belief#

In Probability Theory we used probabilities to express the frequency with which an event occurs: in our case how often we would take a random ball from a bag and it would turn out to be for instance a blue metal ball. This is what we define as a frequentist perspective of probabilities.

In the Bayesian perspective we use probabilities to express belief and common sense, and to consistently update them in the face of convincing evidence. In the Bayesian context, it is useful to interpret probabilities as:

  • High probability: Based on my current knowledge and expectations, this event is likely to happen even if I never observed it directly before;

  • Low probability: My current knowledge suggests this event to be unlikely. There is not enough evidence of the contrary at this point in time.

Going back to the theorem, we can read it as:

\[ \mathrm{posterior}\displaystyle\propto\mathrm{likelihood}\times\mathrm{prior} \]

which from the example above can be thought as:

Interpretation of Bayes’ Theorem

\(p(Y\vert X) = \frac{p(X\vert Y)p(Y)}{p(X)}\)

How much should I adjust my prior belief over \(p(Y)\) into a posterior one \(p(Y\vert X)\) given that I just observed \(X\)? It must be a balance between this prior and what it would take for it to explain what I see (\(p(X\vert Y)\)), tempered by how \(X\) would be explained anyway without \(Y\), \(p(X)\).

Further Reading

Read Section 1.2.3 for an extended discussion on this.

bishop-prml

Example: A two-variable graph#

Let us start with an intuitive example. Consider a probabilistic model to predict if a student passes an exam or not, depending only on whether he/she has studied or not:

../../../../../_images/studygraph1.svg

Fig. 6 A pass/fail model – Prior#

Recall that the joint distribution for this graph reads

(10)#\[ p(S,P) = p(S)p(P\vert S) \]

where we use single letters for the variables for convenience. Read Graph Models again in case this is unclear to you.

Let us make a common-sense assumption that about 60% of the students will study properly. This leads to a prior distribution for \(S\):

(11)#\[ p(S=1) = 0.6 \quad p(S=0) = 0.4 \]

This means that for a given student, before we obtain more information, our best guess is to assume a chance of about 60% that they studied.

Let us further assume that if a person studies we are 90% confident they will pass the exam, while even without studying there is a fair chance of 30% they will still pass due to luck/intuition. This leads to the conditional distributions:

(12)#\[ p(P=1\vert S=1) = 0.9 \quad p(P=1\vert S=0) = 0.3 \]

Because probability distributions need to sum up to one, it necessarily means that:

(13)#\[ p(P=0\vert S=1) = 0.1 \quad p(P=0\vert S=0) = 0.7 \]

Now imagine a student just took the exam and after grading it we observe that they got a passing grade. Our graph then changes to:

../../../../../_images/studygraph2.svg

Fig. 7 A pass/fail model – Observation#

We now have more information to work with. From our prior we were about 60% confident they studied, but now that can be updated with this new piece of information. This is where Bayes’ Theorem comes in:

(14)#\[ p(S=1\vert P=1) = \frac{p(P=1\vert S=1)p(S=1)}{p(P=1)} \]

where we are now looking at a posterior opinion about the student that should reflect our new observation. This effectively changes our graph to:

../../../../../_images/studygraph3.svg

Fig. 8 A pass/fail model – Posterior#

To compute the posterior we first need the marginal \(p(P=1)\). Using the Sum Rule (Probability Theory), we get:

(15)#\[ p(P=1) = p(P=1\vert S=1)p(S=1) + p(P=1\vert S=0)p(S=0) = 0.9\cdot 0.6 + 0.3\cdot 0.4 = 0.66 \]

And from the fact that \(p(P=1)\) must be a valid distribution, the marginal distribution for failing the exam is:

(16)#\[ p(P=0) = 1 - p(P=1) = 0.34 \]

Finally, we use Bayes’ Theorem to obtain:

(17)#\[ p(S=1\vert P=1) = \frac{p(P=1\vert S=1)p(S=1)}{p(P=1)} = \frac{0.9\cdot 0.6}{0.66} = 0.818 \]

So our confidence that he/she studied went from 60% to about 82%! That makes sense, we should update our beliefs in the face of new evidence (a pass in the exam) while still taking into account the possibility that they could have passed anyway.

Further Reading

Section 8.2.1 has another insightful example of Bayes’ Theorem in action, this time for a graph with three nodes. Start from the last paragraph of Page 376 and read until just before 8.2.2. For extra context and more details on conditional independence, start reading from the beginning of 8.2.

bishop-prml