It is computationally difficult to learn from observed data the
joint probability distribution function of a set of random variables. Often approximation of such a function is instead learned. In particular, the method of [Chow and Liu, 1968] approximates the nth order distribution of n random variables with a second order distribution (graph) and further assumes the topology of the binary relation to be a rooted tree. Under such an assumption, the best approximation can be obtained by computing the maximum spanning tree,
where the edge weight between two random variables Xi, Xj is mutual information I(Xi, Xj) computed with
I(Xi, Xj) = ∑a, b P(a, b) log P(a, b)/ q(a) q(b)
where P(a, b) is the joint probability of Xi and Xj to take value pair (a, b), and q(a) and q(b) is the probabiity for Xi to take value a and for Xj to take value b, respectively.
Assume you are given a set of gene expression data that contains the following information:
- There are n genes;
- Each gene has m samples;
- There are c conditions;
- Each sample gives an expression level under each condition;
- Expression levels can be discretized to d values;
Suppose you would like to compute the second order distribution of gene expressions using Chow and Liu's method, i.e., to compute a tree topology that relates these genes based on their gene expression data. Explain how to proceed with the computation using the 5 given assumptions and the data.