Python assignments for the class 'Computer Vision' (Prof. Pollefeys) @ ETH Zurich.
Each of the assignments has been graded with the full grade of 100 points.
Credit for code skeletons, assignment setup and descriptions goes to Prof. Pollefeys and his teaching assistants (course website).
The first task is training a simple binary classifier on the 2D data visualized in the following figure:
In the figure, red points are part of cluster 0, and blue points part of cluster 1.Both a linear classifier and a multi-layer perceptron (MLP) classifier were implemented using PyTorch, where naturally only the MLP showed near-perfect classification results (over 99% accuracy), due to the non-linear (rather circular) separation between the two clusters. Similarly, a coordinate change to polar coordinates while using the linear classifier also showed good results (over 90% accuracy).
For this part, a MLP using 1 hidden layer as well as a conovolution neural network (CNN) were used to classify digits from the MNIST dataset:
While needing less parameters, the CNN achieved better accuracies (over 98%) than the MLP. This shows the advantage of CNN for certain types of tasks. Finally, a so-called confusion matrix for both the MLP (left) and the CNN (right) has been produced, showing the performance of a classifier; Mi,j is the number of test samples for which the network predicted i, but the ground-truth label was j:A vectorized version of the mean-shift algorithm was implemented in order to segment an image of a cow shown on the left; the ground-truth is shown in the middle and the result on the right:
In the second part, a simplified version of the SegNet CNN has been implemented, as depicted below. The resulting architecture has then been trained and shown to achieve accuracies (mean Intersection over Union (IoU)) of over 85%.:
A camera has been calibrated based on correspondences between an image and points on a calibration target. Given the correspondences, the (simplified) process of calibrating the camera consists of data normalization, estimating the projection matrix using Direct Linear Transform (DLT), optimizing based on the reprojection errors and then decomposing it into camera intrinsics (K matrix) and pose (rotation matrix R and translation vector T). The result with the reprojected points from the calculated pose and intrinsics are shown in the following figure:
From provided extracted keypoints and feature matches, the SfM pipeline combines reative pose estimation, absolute pose estimation and point triangulation. To initialize the scene, the relative pose between two images has been estimated; afterwards, the first points in the scene have been triangulated and then iteratively more images were registered and new points triangulated. Using this, for the fountain shown in the left image, the resulting 3D pointcloud including the camera poses of the used views can be obtained as in the right image:
To demonstrate the working principle of RANSAC it has been applied to the simple case of a 2D line estimation on data with outliers; the results can be seen in the following:
This describes the task of reconstructing the dense geometry of an observed scene. In this assignment, the multi-view stereo problem is solved using deep learning. The pipeline of the method is shown in the following figure:
First, the features are extracted for several images. Second, given uniformly distributed depth samples, source features are warped into the reference view. Third, the matching similarity between the reference view and each source view is computed. Fourth, the matching similarity from all the source views is integrated. Fifth, regularization on integrated matching similarity to output probability volume is performed. Finally, using depth regression the final depth map is obtained. On CPU, the model needs several hours for training. This has been applied to several scenes of the DTU MVS dataset. First, the ground truth (left) and estimated depth map (right) can be observed for example for a scan of a 'rabbit':
One can also observe the L1 loss between estimated depth and the groun truth converging over the 4 epochs of training in the following:
Some of the resulting 3D colored point clouds using the trained model are shown in the following:
The classifier should decide whether an image contains the back of a car (sample image on the left) or not (right):
The approach of this classifier is the following: after detecting features on an image of the training dataset, they are described using the histogram of oriented gradients (HoG) descriptor and all descriptors from all images are then clustered into a visual vocabulary using k-means clustering. Each image is described using a Bag-of-words (BoW) histogram, and a test image is then classified according to its nearest neighbor in the BoW histogram space. This rather basic approach has shown to achieve accuracies of 95% and higher on both the positive and negative samples.
A simplified version of the VGG image classification network on the CIFAR-10 dataset has been implemented. The goal is to correctly classify images into one of 10 different classes. After training in 80 epochs, accuracies of around 85% can be achieved even with this simplified version:
The CONDENSATION algorithm - CONDitional DENSity propagaTION over time - is a sampled-based solution of a recursive Bayesian filter applied to a tracking problem. Here, the tracker used a color histogram to follow a template in the image over several frames. For each new frame, it checks samples in a cloud of points around the previous position, weighs them according to their similarity with the template, takes the weighted average position and updates the color histogram as a convex combination between the previous and current template color histogram (which can be controlled using a parameter α). The tracking performance can be tuned using several parameters and then applied to the simple example of following a hand in open space (left), and the more advanced examples of following a hand through an occlusion and in front of a noisy background(middle) as well as following a ball bouncing of a wall (right):
For the second example of following a hand through an occlusion and in front of a noisy background an animated version of the tracking can be seen in the following: