CSCIx490 Homework #4 Spring 2016

CSCI 4490-6490 Algorithms for Computational Biology Spring 2016

Homework Assignment 5

To build a profile HMM for a family of sequences usually requires that a set of training sequences be multiple-aligned so that the maximum likelihood method can be used to estimate the probability parameters of the profile HMM. In the situation where such a multiple alignment is not available or not accurate for a given set of training sequences, the probability parameters of the profile HMM may still be reasonably estimated. Explain how such a process can be extended to parameter estimation for the construction of general HMMs when only non-structured training data is available. To simplify your discussion, you may assume an HMM with begin state S_b and end state S_e. You may just explain how to estimate the transition probability parameter p(S_i, S_j) and the emission probability q(S_i, ‘A’), where S_i and S_j are two states in the HMM and ‘A’ is a letter in the alphabet.
Stochastic context-free grammar (SCFG) is a generalization of HMM to make it possible to model (limited) coevolution of molecular residues (e.g., co-emit a pair of residues a and b with rules of the format X → aYb). The SCFG rules also permit the degenerated format X → aY that models the emission of single residue a. The application of such a rule can also be considered as a state transition from X to Y. Argue that SCFG is a more general model than HMM by showing that rules like X → aY can be used to replace transition and emission needed in an HMM. Your argument may be given by showing some specific SCFG rules for the following portion of an hypothetical HMM:
- State A has transition to state B with probability p;
- State A has transition to state C with probability 1-p;
- State B is a silent state, and
- State C emits 'a', 'c', 'g', and 't' with equal probability 1/4.
It is computationally difficult to learn from observed data the joint probability distribution function of a set of random variables. Often approximation of such a function is instead learned. In particular, the method of [Chow and Liu, 1968] approximates the n^th order distribution of n random variables with a second order distribution (graph) and further assumes the topology of the binary relation to be a rooted tree. Under such an assumption, the best approximation can be obtained by computing the maximum spanning tree, where the edge weight between two random variables X_i, X_j is mutual information I(X_i, X_j) computed with
where P(a, b) is the joint probability of X_i and X_j to take value pair (a, b), and q(a) and q(b) is the probabiity for X_i to take value a and for X_j to take value b, respectively.
Assume you are given a set of gene expression data that contains the following information:
- There are n genes;
- Each gene has m samples;
- There are c conditions;
- Each sample gives an expression level under each condition;
- Expression levels can be discretized to d values;
Suppose you would like to compute the second order distribution of gene expressions using Chow and Liu's method, i.e., to compute a tree topology that relates these genes based on their gene expression data. Explain how to proceed with the computation using the 5 given assumptions and the data.

This homework is due on Thursday April 28, 2016.