Table 1: The three levels of analysis suggested by Marr and Poggio
(reproduced from [Marr, 1982]). This thesis focuses on the
computational theory level of explanation. We formulate a set of
assumptions and constraints that may be used to analyze motion and
compare the predicted percept to psychophysical data.
Trying to understand perception by studying only neurons is like trying to understand bird flight by studying only feathers: It just cannot be done. In order to understand bird flight, we have to understand aerodynamics; only then do the structure of feathers and the different shapes of birds' wings make sense. [Marr, 1982] p. 27
In order to understand how the human visual system resolves the ``integration versus segmentation dilemma'' we use the method of computational modeling. Marr and Poggio [Marr, 1982,Marr and Poggio, 1977] distinguished between three levels of understanding complex information-processing devices: the computational, the algorithmic and the implementation levels. Figure 4 reproduces Marr's summary of the three levels. The computational level is the most abstract; it describes the problem the system is trying to solve and the constraints it uses in order to solve it. The algorithmic level addresses questions of representations and the algorithm used to satisfy the constraints and assumptions of the system. Finally, the implementation level deals with the details of the hardware in which the algorithm is embodied.
Here we focus on the most abstract level, that of computational modeling. Rather than describing a particular biological implementation of a particular algorithm, we attempt to find the constraints and assumptions used by the visual system when estimating motion. The term ``computational theory'' is meant to emphasize the difference between this approach and a ``verbal theory''. A computational theory of vision requires more than a list of assumptions and constraints; rather these assumptions and constraints should be formulated in such a way that they can be computed for a given scene from the image data and therefore predict a percept for that scene.
In order to formalize our theory we employ a Bayesian inference framework [Knill and Richards, 1996]. In the Bayesian formalism, the assumptions are formulated as probabilities and inference corresponds to finding the probability of hypotheses given observations. In the context of motion analysis, we express our assumptions in terms of prior probabilities --- the probability of a motion hypothesis in the absence of any data and likelihoods --- the probability of the image data given a motion hypothesis. We use the machinery of Bayesian inference to calculate the posterior probability --- the probability of a motion hypothesis given the image data.
Finding the most probable motion hypothesis may require rather sophisticated computational algorithms. For example, in this thesis we use inversion of large matrices and an iterative algorithm known as Expectation-Maximization [Dempster et al., 1977]. We do not, however, claim that the brain uses these algorithms. We use these algorithms here as tools in order to test our computational theory; the prediction of the theory is obtained by finding the most probable motion hypothesis for a given scene. We then compare this predicted motion to the human percept. If we find that the predicted motion matches the human percept, this supports our computational theory but says nothing about the algorithm or implementation the human visual system uses in order to arrive at the same percept.