CSCI 4490-6490 Algorithms for Computational Biology Spring 2016

Homework Assignment 5


  1. To build a profile HMM for a family of sequences usually requires that a set of training sequences be multiple-aligned so that the maximum likelihood method can be used to estimate the probability parameters of the profile HMM. In the situation where such a multiple alignment is not available or not accurate for a given set of training sequences, the probability parameters of the profile HMM may still be reasonably estimated. Explain how such a process can be extended to parameter estimation for the construction of general HMMs when only non-structured training data is available. To simplify your discussion, you may assume an HMM with begin state Sb and end state Se. You may just explain how to estimate the transition probability parameter p(Si, Sj) and the emission probability q(Si, ‘A’), where Si and Sj are two states in the HMM and ‘A’ is a letter in the alphabet.

  2. Stochastic context-free grammar (SCFG) is a generalization of HMM to make it possible to model (limited) coevolution of molecular residues (e.g., co-emit a pair of residues a and b with rules of the format X → aYb). The SCFG rules also permit the degenerated format X → aY that models the emission of single residue a. The application of such a rule can also be considered as a state transition from X to Y. Argue that SCFG is a more general model than HMM by showing that rules like X → aY can be used to replace transition and emission needed in an HMM. Your argument may be given by showing some specific SCFG rules for the following portion of an hypothetical HMM:

  3. It is computationally difficult to learn from observed data the joint probability distribution function of a set of random variables. Often approximation of such a function is instead learned. In particular, the method of [Chow and Liu, 1968] approximates the nth order distribution of n random variables with a second order distribution (graph) and further assumes the topology of the binary relation to be a rooted tree. Under such an assumption, the best approximation can be obtained by computing the maximum spanning tree, where the edge weight between two random variables Xi, Xj is mutual information I(Xi, Xj) computed with

    where P(a, b) is the joint probability of Xi and Xj to take value pair (a, b), and q(a) and q(b) is the probabiity for Xi to take value a and for Xj to take value b, respectively.

    Assume you are given a set of gene expression data that contains the following information:

    Suppose you would like to compute the second order distribution of gene expressions using Chow and Liu's method, i.e., to compute a tree topology that relates these genes based on their gene expression data. Explain how to proceed with the computation using the 5 given assumptions and the data.

This homework is due on Thursday April 28, 2016.