b

DiscoverSearch
About
My stuff
MVC-Net: A Convolutional Neural Network Architecture for Manifold-Valued Images With Applications
2020·arXiv
Abstract
Abstract

Geometric deep learning has attracted significant attention in recent years, in part due to the availability of exotic data types for which traditional neural network architectures are not well suited. Our goal in this paper is to generalize convolutional neural networks (CNN) to the manifold-valued image case which arises commonly in medical imaging and computer vision applications. Explicitly, the input data to the network is an image where each pixel value is a sample from a Riemannian manifold. To achieve this goal, we must generalize the basic building block of traditional CNN architectures, namely, the weighted combinations operation. To this end, we develop a tangent space combination operation which is used to define a convolution operation on manifold-valued images that we call, the Manifold-Valued Convolution (MVC). We prove theoretical properties of the MVC operation, including equivariance to the action of the isometry group admitted by the manifold and characterizing when compositions of MVC layers collapse to a single layer. We present a detailed description of how to use MVC layers to build full, multi-layer neural networks that operate on manifold-valued images, which we call the MVC-net. Further, we empirically demonstrate superior performance of the MVC-nets in medical imaging and computer vision tasks.

In computer vision, convolutional neural networks (CNN) and its variants are ubiquitous and serve as omnipotent tools for various tasks, e.g. image classification and segmentation. However, the traditional CNNs are restricted to data residing in vector spaces while data residing in smooth non-Euclidean spaces, e.g. Riemannian manifolds, arise naturally in many problem domains. Although Riemannian manifolds lack the vector space structure, the associated Riemannian metric induces the notions of distance and angle (between intersecting curves on the manifold) intrinsic to the manifold. Commonly encountered examples of Riemannian manifolds in computer vision are the manifold of (n × n)symmetric positive-definite (SPD) matrices,  Pn, thespecial orthogonal group SO(n), the Grassmann manifold, Gr(n, p) and the n-sphere,  Sn. Recently, there has been a growing interest in generalizing the well-known CNN and its variants to cope with these types of data while respecting the underlying geometry.

In the past few years, there has been a surge in research to develop deep neural networks that deal with data residing on the aforementioned Riemannian manifolds. At the outset, it will be useful to categorize two types of problems concerning data in non-Euclidean spaces. These two types are: (i) data that are samples of functions defined on smooth manifolds and (ii) data that are samples of manifold-valued functions whose domain is Euclidean or data that are simply sample points on manifolds. In this paper we will address the problem of developing deep neural networks for the data defined in (ii).

In the context of data defined in (i) above, in the recent past, there has been a flurry of research activity in developing the analogs of CNNs. For example, Masci et al. (2015) presented the geodesic convolutional neural network (GCNN) for which they defined the geodesic convolution as standard convolution on the local geodesic charts. Poulenard & Ovsjanikov (2018) presented convolution for directional functions which reduces to the usual convolution when the underlying manifold is  Rd. In both Masci et al. (2015) and Poulenard & Ovsjanikov (2018), convolutions are performed in local geodesic polar charts constructed on the manifold. Moving on, samples of functions defined on a sphere are encountered in numerous applications of computer vision and to this end, there is the spherical CNN work reported in Esteves et al. (2018), Kondor et al. (2018), and Cohen et al. (2018). In this problem, group equivariant convolutions were used to replace the standard convolutions in CNNs. Note that the group action on the sphere corresponds to rotations in 3D which are members of the group SO(3). Recently, the equivariance of convolutions to more general classes of group actions suited for other Riemannian homogeneous spaces has been reported in Kondor & Trivedi (2018), Banerjee et al. (2019), and Cohen et al. (2018). We will not discuss methods suited for this type of data any further in this paper but refer the reader to Bronstein et al. (2017) who present a good survey of state-of-the-art in geometric deep learning.

In the context of data described in (ii) above, Huang & Van Gool (2017) proposed a network architecture that consisted of layers which explicitly utilize the structure of SPD matrices. Huang et al. (2018) presented a deep network for classification of hand-crafted features residing on a Grassmann manifold. However, the above architectures do not resemble the classic convolutional layer in the traditional CNN which is viewed as one of the key component to the success of CNNs. Furthermore, the operations used in the above network are not valid for general Riemannian manifolds. For example, in Huang & Van Gool (2017), applying ReLU and logarithms on the eigenvalues is not valid for Grassmann manifolds. Besides convolutional layers, batch normalization is also a useful trick in CNN to avoid over-fitting and Brooks et al. (2019) proposed a batch normalization technique for manifold-valued networks. In this paper, we focus our attention to data represented on a grid where each of the grid points is associated with a value on a known manifold, e.g.  f : Z2 → M. However, all the aforementioned works are targeted for specific manifolds, e.g. the Grassmann or the SPD manifolds. The lack of a consistent framework for designing deep network architectures for data residing on a general Riemannian manifold is partly due to the fact that there is no natural analog of convolution operation for manifold-valued data. This justifies the need to generalize the convolution operation for data in Riemannian manifolds in order to develop a consistent framework for deep learning to tackle such data. Recently, Chakraborty et al. (2019) proposed to use weighted Fr´echet mean (wFM) (Maurice Fr´echet, 1948) as an analog to the classical (Euclidean space) convolution operation for data points residing in Riemannian manifolds. Note that although their definition of wFM as an analogous operation is valid for any Riemannian manifold, the convexity constraints in the definition used for wFM puts certain restrictions on the range of values that the wFM can take on and this can limit the modeling capacity of the network as we will see later.

In order to generalize the (discrete) convolution operation in Euclidean spaces – which is simply a linear combination of weights and image function values inside a certain window – to Riemannian manifolds, we have to define what is a meaningful “equivalent” of the aforementioned linear combination operation in the Riemannian manifold setting. In this paper, we propose to make use of the idea that it is possible to map the manifold-valued data points within a convolution window defined over the manifold-valued image to the tangent space anchored at the FM of these points using the Riemannian Log map. Then, perform the linear combination operation in the tangent space (which is isomorphic to the Euclidean space) and map it back to the manifold using the Riemannian Exp map. We provide the details of this operation called the manifold-valued convolution (MVC) in the next section. Further, we prove that the proposed MVC is equivariant to the isometry group actions admitted by the manifold. Armed with MVC, we then describe how to build a MVC-Net for manifold-valued data by defining the corresponding activation functions and fully-connected (FC) layers for the manifold-valued data. Thus, the main contributions of our work in this paper are the following. (i) We define the MVC operation for general Riemannian manifolds and a prove that MVC is equivariant to isometry group actions admitted by the manifold. (ii) We present a deep neural network architecture based on MVC, called MVC-Net, for any Riemannian manifold. (iii) Further, we present experiments demonstrating performance of the MVC-Net on classification problems encountered in medical image analysis and computer vision along with comparisons to the state-of-the-art.

The rest of this paper is organized as follows. In section 2, we review some essential background in Riemannian geometry. In section 3, we propose a novel generalization, the MVC, of the convolution operation for Riemannian manifold-valued images and show that MVC is equivariant to isometry group actions admitted by the manifold. Then, we propose a deep neural network architecture based on MVC, called the MVC-Net. In section 4, we present the experimental results and finally draw conclusions in section 5.

In this section, we review some basic material from Riemannian geometry that is necessary in our work.

Let (M, g) be a d-dimensional Riemannian manifold. For p ∈ M, the tangent space of M at p is denoted  TpM, which is a d-dimensional vector space. Equipped with the Levi-Civita connection, the geodesic starting at p is denoted γv : I → Mwith  γv(0) = pwhere I is some interval containing 0, and v is the initial tangent vector, i.e.  γ′v(0) =v. Sometimes a geodesic is specified by the two endpoints p, q and in this case we denote the geodesic by  γp,qsuch that  γp,q(0)and  γp,q(1) = q. The Exponential map Expp :D(p) ⊂ TpM → Mis defined by Expp(v) = γv(1) whereD(p) = {v ∈ TpM : γv(1) is defined}. The exponential map is a diffeomorphism from D(p) to its range, and its inverse is denoted Logp = Exp−1p . These two maps will be of fundamental importance for our proposed layer which is discussed later in this section.

In general, there is no global coordinate system on a Riemannian manifold. Therefore, a local coordinate system is important for doing computations on Riemannian manifolds. The most common one is called the normal coordinate which is based on the Riemannian exponential map and the log map respectively. The normal coordinates are constructed as follows. For  p ∈ M, there exist a neighborhood  Up ⊂ Mof p and a neighborhood  V ⊂ TpMsuch that Exppis a diffeomorphism between  V and Up (Lemma5.10 in Lee (2006)). The neighborhood  Upis called the normal neighborhood. The normal coordinate of  q ∈ Up withrespect to the normal neighborhood  Upis given by Logp(q).This concept is important as we will use it in the definition of manifold-valued convolution in section 3.

The Riemannian metric g, induces a distance given by,

image

Let  x1, . . . , xn ∈ M. The Fr´echet mean (FM) of  x1, . . . , xnis

image

This is a generalization of mean of points in a vector space. The existence and uniqueness of the FM is discussed in Afsari (2011). To be precise, the FM is unique if  x1, . . . , xnlie in a open ball of radius  rcvx, where  rcvxis the convexity radius of M (Groisser, 2004). In practice, it is always assumed to be this case.

With this intrinsic distance metric, the Riemannian manifold (M, dg)is a metric space and a natural transformation under consideration would be the isometry. For a Riemannian manifold, a transformation  φ : (M, g) → ( ˜M, ˜g)is called an isometry if it is a diffeomorphism and  g = φ∗˜gwhere φ∗ is pullback operation of  φ. In this work, we consider the isometry from M to M. It is known that the collection of isometries forms a group under composition, denoted I(M). For a smooth map  f : M → M, a desired property would be the isometry equivariant, i.e.  φ ◦ f = f ◦ φ. Another similar concept is the isometry invariance, i.e.  f ◦ φ = f.

Remark. Note that with a slight abuse of notation, for a metric space X, we denote the set of all isometry transformations of X by I(X) as well.

In this section, we present the MVC and show that it is equivariant under isometry group actions admitted by the manifold. Then we present the architecture of MVC-Net by introducing the basic constituent layers of the MVC-Net.

3.1. Manifold-valued convolution (MVC)

Recall that in a CNN the convolution operation involves a linear combination of the data in the window, i.e. �ni=1 wixi. Due to the lack of vector space structure on Riemannian manifolds, we can not perform this usual convolution on manifold-valued images directly. In this work, we propose a generalization of the above described standard convolution to manifold-valed images, called the manifold-valued convolution (MVC) defined as follows.

Definition 1. Let (M, g) be a Riemannian manifold and f : Zn → Mand  w : Zn → Rbe two functions defined on  Znwhere Z is the set of all integers. The convolution, f ∗ w : Zn → Mis defined by

image

for  y ∈ Zn where m = FMz∈Zn(f(z)).

An illustration of the MVC operation can be seen in Figure 1 An important property of the convolution operation in Euclidean spaces is that it is equivariant to translation which is the natural isometry group action for Euclidean spaces. Thus, MVC, as a generalization of the convolution to Riemannian manifold-valued images, is expected to possess such property, i.e. equivariant to isometry group actions admitted by the manifold. The following lemma is useful for proving this result.

Lemma 1. Let  φ : M → Mbe an isometry. Then for

image

where  dφpis the differential of  φat p. Therefore when the inverse of Expp exists,

image

The proof of this lemma can be found in most of the introductory textbooks in Riemannian geometry, e.g. proposition 5.9 in Lee (2006).

Theorem 1. The MVC is equivariant to isometry group actions both in the domains and the ranges of f and w i.e. for  f : Zn → M and w : Zn → R,

image

image

Figure 1. Tangent combination operation.

Proof. We show only (2) here since the other three equalities follow from similar derivation. First, note that the FM of  φ ◦ fis  φ(FMz∈Zn(f(x))) := φ(m)where m = FMz∈Zn(f(x)). This is a consequence of the invariance of the intrinsic distance metric. Then for  y ∈ Zn

image

This concludes the proof. ■

Note that the equivariance is preserved even if the FM m is replaced by any other points as long as the choice of the point is also equivariant, e.g. replace m by  f(z0), for some  z0 ∈ Zn. This avoids the computation of the FM and hence is computationally more efficient. In practice, the analytic forms of f and w are unknown and only  xi = f(zi)and  wi = w(zi), i = 1, . . . , nare observed for some fixed z1, . . . zN ∈ Zn. Thus from now on, we consider  {xi}Ni=1and  {wi}Ni=1instead of f and w. For this situation, the MVC can be simplified as

image

where  ¯x = FM�{xi}Ni=1�. For applications in computer vision and medical imaging, the domain  Znis usually  Z2or  Z3.

3.2. Activation Functions for MVC-Net

In classical neural networks, the activation functions, e.g. ReLU, sigmoid, tanh, etc., play an important role as they make the resulting network non-linear and thus we are able to build a deep neural network by stacking layers of different sizes along with the activation functions. The choice of activation functions has been studied extensively and there are a few guidelines for choosing one. First, the activation function must be a contraction map (Mallat, 2016). The precise definition of a contraction map will be given later. Second, the activation function should prevent multiple stacked layers of the network from collapsing to a single layer, which allows us to build a deep network. In this section, we will analyze the MVC-net in the context of the above guidelines. We first show that the MVC layer is not a contraction and under some conditions cascaded MVC layers will collapse into one. Then we give a possible choice of an activation function for use in the MVC-net.

Contraction Property

The following definition of contraction is from Mallat (2016).

Definition 2. Let  F : U → Vwhere U and V are metric spaces with distance metrics  dUand  dV, respectively. The mapping F is called a contraction if for  x, y ∈ U, there exists c < 1 such that  dV (F(x), F(y)) < cdU(x, y). If for all  x, y ∈ U, dV (F(x), F(y)) < dU(x, y), then F is calleda non-expansion.

Since the range of MVC is a normal neighborhood of the anchor point, it can be easily shown that the MVC layer is not a contraction by considering large  wi’s.

Collapsibility Property

In classical neural networks, one reason for adding non-linear activation functions between layers, e.g. sigmoid, ReLU, tanh, is that without these, the multi-layer network collapses into a single-layer network. We want to know if a similar behavior is exhibited by the MVCnet. For example, consider a network with two MVC layers (without non-linear activation in between). For the sake of simplicity, suppose that there are only two MVC “filters” in the first MVC layer and one MVC “fil-ter” in the second MVC layer, i.e. the first MVC layer takes  {xi}2Ni=1as input with weight  {wi}2Ni=1and the sec- ond MVC layer takes  {M1, M2}as input with weights {h1, h2}where  M1 = MVC�{xi}Ni=1, {wi}Ni=1�and M2 = MVC�{xi}2Ni=N+1, {wi}2Ni=N+1�. Is this two-layer MVC-net equivalent to a one-layer MVC-net i.e., does there exist  { ˜wi}2Ni=1such that MVC({M1, M2}, {h1, h2}) =MVC�{xi}2Ni=1, { ˜wi}2Ni=1�? We answer this question in the affirmative under some conditions as stated in the following theorem.

Theorem 2. Let  {xi}2Ni=1 ⊂ M. If {xi}Ni=1 and {xi}2Ni=N+1belong to the same normal coordinate chart U then, two cascaded MVC layers will collapse to a single layer.

Proof. As mentioned earlier, the anchor point of map (1) can be any point in the normal coordinate chart. Let  p ∈U ⊂ Mbe such a point. Consider the weights  {wi}2Ni=1 for{xi}. Apply the map (1) first to  {xi}Ni=1and  {xi}2Ni=N+1separately and obtain  M1 = MVC�{xi}Ni=1, {wi}Ni=1�and  M2 = MVC�{xi}2Ni=N+1, {wi}2Ni=N+1�. Then apply the map (1) to  M1and  M2again and obtain M = MVC�{M1, M2}, {h1, h2}�. We will show that there exists  { ˜wi}2ni=1 such that M = MVC�{xi}2Ni=1, { ˜wi}2Ni=1�. Ob-serve that

image

Hence,  ˜wi = h1wifor i = 1, . . . , N and  ˜wi = h2wifor i = N +1, . . . , 2N and the two layers collapse into a single layer. ■

If we consider different normal charts for  {xi}Ni=1and {xi}2Ni=N+1, i.e.  U¯x1for  {xi}Ni=1and  U¯x2for  {xi}2Ni=N+1, then the cascaded two layer structure will not collapse. However, to avoid any possibility of a collapse, e.g. in the case that  d(¯x1, ¯x2) ≈ 0, we recommend the inclusion of a non-linear activation function between the layers. The choice of activation functions for manifold-valued input are however limited. As the most widely used activation function in CNN is ReLU, we propose to use the tangent ReLU (tReLU) (Chakraborty et al., 2019) as the activation function for the MVC-Net.

3.3. Manifold-valued Fully-connected (MVFC) Layers for MVC-Net

The outputs of the last MVC layer/tReLU layer would be a set of points on the manifold M. Therefore the desired FC layer should take points on the manifold as inputs and output labels (hard assignment) or probability vectors (soft assignment). In this work, we adopt the FC layer used in Chakraborty et al. (2019), i.e. for  {x1, . . . , xn} ⊂ M, first transform  {x1, . . . , xn}to  {dg(x1, ¯x), . . . , dg(xn, ¯x)}and then apply the usual (Euclidean) FC layers as in CNN.

3.4. Architecture of MVC-Net

For classification problems, the architecture we use in this work is parallel to CNN, i.e.

image

The number and the size of the layers will be presented in section 4 as it depends on the experiment settings. Besides the classical CNN, different deep network architectures for data in Euclidean space have been proposed to solve specific application problems and the convolutional layer serves as the basic component in most of them. In a similar manner, for manifold-valued data, based on the application problem, we envision an appropriate architecture with MVC layers as the building blocks.

In this section we present several experiments demonstrating the performance of the MVC-net. The experiments involve the use of data from medical imaging as well as computer vision domains. In all the experiments, we present comparisons to the state-of-the-art.

4.1. Parkinson’s Disease Classification

In this section, we apply the MVC-Net to a classification problem in the field of movement disorders, specifically, using diffusion magnetic resonance images (dMRIs) to classify Parkinson’s disease (PD) patients from controls.

Diffusion MRI Data Acquisition and Pre-processing

The dataset we use in this work consists of dMRIs acquired from 355 Parkinson’s disease (PD) patients and 356 control (healthy) subjects. This data was acquired from a combination of three sources namely, (i) The University of

image

Figure 2. Comparison results on Diffusion MRI classification.

image

Figure 3. SMATT (Archer et al., 2017) motor tract segmentation examples.

Florida (UFL), (ii) The Parkinsons Progression Markers Initiative (PPMI) database (www.ppmi-info.org/data) and (iii) The University of Michigan. The data acquired at UFL is publicly available for research use by request via the National Institute of Neurological Disorders (NINDS) Parkinson’s Disease Biomarker Program (PDBP). This PDBP data contained images that were collected using a 3.0 T MR scanner and 32-channel quadrature volume head coil. The scanning parameters of the dMRIs acquisition sequence were as follows: gradient directions = 64, b-values =  0/1000 s/mm2, resolution = 2mm uniform voxel size. The data from University of Michigan was obtained using a 3T Phillips MR scanner and the parameters were, gradient directions= 15, b-values =  0/800 s/mm2, resolution = 1.75mm uniform voxel size. Eddy current correction was applied to each data set by using standard motion correction techniques.

From each of these dMRIs, 12 regions of interest (ROIs) – six on each hemisphere of the brain – in the sensorimotor tract are segmented by registering to the sensorimotor area tract template (SMATT) (Archer et al., 2017). These tracts are known to be affected by PD. Figure 3 depicts the M1, dorsal premotor cortex (PMd) and the supplementary motor area (SMA) tracts . In our experiments, we adopt the most widely used representation of dMRI in the clinic namely, diffusion tensor images. Diffusion tensors (DTs) are symmetric positive-definite matrices (Basser et al., 1994).

Diffusion Tensor Representation

The DTI representation of diffusion weighted images assumes a local Gaussian distribution of water diffusion within each voxel (Basser et al., 1994). The covariance matrix of each local Gaussian represents the diffusion tensor, which is a symmetric positive definite (SPD) matrix. Thus we have a field  f : U ⊂ Z3 → P3. We can equip the space  P3with the GL(3)-invariant metric to make it a Riemannian homogeneous manifold.

We estimate the diffusion tensor images from the segmented dMRIs of the sensorimotor tracts using the DiPy software (Garyfallidis et al., 2014). This data is fed directly into an MVC-net with five MVC + tReLU layers. The output from the last of these layers forms the input to an MVFC layer which maps this input into  Rn. Next, two standard fully connected layers are applied to this  Rn-valued input followed by a softmax function to output class probabilities. This architecture was found to give the best performance among similar architectures.

Classification Results

We compared the performance of MVC-Net with several deep net architectures including the ManifoldNet (Chakraborty et al., 2019), the ResNet-34 CNN architecture and a CapsuleNet architecture with dynamic routing. To perform the comparison, we applied each of the aforementioned deep net architectures to the above described diffusion tensor image data sets.

We train our MVC-net architecture for 200 epochs using cross-entropy loss and an Adam optimizer with learning rate set to 0.005. We obtain a 10-fold cross validation accuracy of 97.8%. For the ManifoldNet, we achieved a 10-fold cross validation accuracy of 94.8%. The ResNet-34 and CapsuleNet architectures are trained directly on the diffusion weighted images (without any diffusion tensor fitting to the dMRI data in the ROIs since they can not cope with symmetric positive definite matrix-valued images). With the ResNet-34 architecture we observe significant overfitting late in training and we utilize an early stoppage approach to report the best 10-fold cross validation result, which still significantly under-performs the MVC-net and ManifoldNet (the only two approaches that respect the underlying geometry of  P3) respectively. Comprehensive results are reported in Table 2.

As evident from the Table 2, MVC-net outperforms all other methods on both training and test accuracy while simultaneously keeping the lowest parameter count. The inference speed under-performs ResNet-34 and CapsuleNet, but these architectures utilize operations that have been optimized heavily for inference speed for years. Further, in terms of the possible application domain of automated Parkinson’s diagnosis, the sub-second (less than a second) inference speeds we have achieved are more than sufficient in practice.

4.2. Anatomical Structure to Function Regression

In this experiment we consider the problem of learning a function from a structural image of the human brain to a functional physiological measurement. Specifically, we consider the problem of mapping from Cauchy Deformation Tensor (CDT) images estimated relative to an atlas of diffusion MRI scans of the Substantia Nigra (Banerjee et al., 2016), a neuro-anatomical region known to be affected by movement disorders to MDS-UPDRS scores. The MDS-sponsored Revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) is a quantitative measure of PD severity assigned by a physician that combines various physical and psychological biomarkers associated with PD such as sleep quality, depression, and motor skills. The CDT of a diffusion MRI scan captures the deviation of a particular subject from a reference atlas (i.e. an ”average” brain over the population), thus the CDT captures structural information about a particular brain.

Data Acquisition and Pre-processing

The data here consists of high angular resolution diffusion MRI (HARDI) (Tuch et al., 2002) images of 25 controls, 15 essential tremor (ET) patients and 26 PD patients acquired using the same parameters as the PDBP data in the previous experiment. For each patient we have corresponding MDS-UPDRS scores. We segment the Substantia Nigra (40 voxels large) from each of these images. Each image is pre-processed to estimate an Ensemble Average Propagator (EAP) at each voxel leading to an EAP field representation of the dMRI data. The EAP P(x, r) is a probability distribution that describes the likelihood of water diffusing along a vector r (Johansen-Berg & Behrens, 2013). To compute the CDT we follow a standard procedure which we outline here. First, we non-rigidly register (Cheng et al., 2009) each of the EAP-field images to the Montreal Neurological Institute (MNI) reference atlas (Fonov et al., 2011). Let J be the Jacobian of the non-rigid registration, then the CDT at each voxel is given by√JJT. This gives a  3 × 3SPD matrix at each voxel, hence for each sample we have a 40×3×3dimensional tensor. Thus, to summarize, the independent variables are  40×3×3sized CDT fields describing structural properties of a particular human brain, and the dependent variables are the vector of MDS-UPDRS scores, quantifying functional severity of movement disorders.

We compare an MVC-net architecture operating on the space  P3where the CDT descriptors live. The architecture for this problem consists of 3 MVC + tReLU layers followed by a MVFC and two Euclidean fully connected layers plus a softmax layer. We compare the performance of the MVC-net to state-of-the-art methods for this task in Chakraborty et al. (2019) and Banerjee et al. (2016). The performance is quantified in terms of the  R2statistic. Results are summarized in Table 4.

image

Figure 4. Structure to Function Regression  R2-statistic.

As is evident from Table 4, MVC-net outperforms the competing methods on this particular task, although all methods perform well. Beyond this, MVC-net again achieves signifi-cant parameter efficiency, with  ∼ 10Kparameters for this architecture. Future work will focus on evaluating MVC-net on this task for much larger datasets.

4.3. Video Classification

We now outline an architecture for using MVC-net together with covariance blocks (Yu & Salzmann, 2017) to perform video classification. We present results of applying this MVC-net architecture to the Moving MNIST dataset, which is generated using the algorithm in Srivastava et al. (2015). Each video consists of two MNIST digits moving across the frame. The velocity of both digits is fixed across all videos in a class, but the digits themselves vary (in the range  0− 9).Different classes have different angles of motion, and the goal is to classify them based on this angle.

Architecture for Video Classification

We now present an MVC-net architecture for video classifi-cation. Given an input video of dimensions  F ×3×H ×W,a covariance block (Yu & Salzmann, 2017) is applied in parallel to each frame to yield an  F ×(C +1)×(C +1) tensor.An illustration of the architecture is shown in Figure 5. We will now describe the components of this architecture.

image

Figure 5. MVC-net Video Classification Architecture

image

Figure 6. Comparison results on Moving MNIST. All classification results are 10-fold cross validation test accuracy.

For completeness, we will summarize the covariance block design from Yu & Salzmann (2017) below. The input to the covariance block is an image of size  3 × H × W. We first apply a regular CNN without fully connected layers at the end to get a  C × H′ × W ′sized output. Now we interpret each channel as a feature vector and compute a C × Ccovariance matrix of the channel activations. Finally, to incorporate the first order statistics, we append the mean channel activation to both the last row and column of the covariance matrix to get a  (C + 1) × (C + 1)shaped output.

As mentioned before, applying a covariance block at each frame of a video in parallel yields a  F × (C + 1) × (C + 1)shape tensor, where at each frame we have a  (C+1)×(C+1)covariance matrix, which is an element in the space  PC+1. We now use a one-dimensional temporal MVC-net architecture to map the per-frame covariance descriptors to class outputs. This is no different than traditional temporal CNNs, i.e. at each layer, a moving window slides over the frames and computes a weighted combination. For our architecture, we use the manifold-valued convolution defined earlier on this sequence of frames each represented by a covariance matrix descriptor . Figure 5 depicts a schematic of the MVC-net tailored for the video classification problem.

Experimental Results: For this experiment we use five MVC + tReLU layers, followed by an MVFC layer and two Euclidean fully connected layers and a softmax. We use an Adam optimizer with learning rate set to 0.005 and train for 300 epochs using the cross-entropy loss. 10-fold cross validation results are summarized in Table 6. As evident, the MVC-net either outperforms or is competitive with all competing methods in terms of test accuracy.

In this paper, we presented a generalization of CNNs to manifold-valued images i.e., images whose value sets lie in Riemannian manifolds. Such data are commonly encountered in many applications including but not limited to medical imaging and computer vision. We defined the the analog of the traditional convolution operation to manifold-valued images and proved that it is equivariant to the isometry group actions admitted by the manifold. Equivariance is a fundamental design principle in traditional CNNs that affords weight sharing in the deep neural networks. Further, we also proved that a multi-layer MVC-Net requires the use of nonlinear activation functions and proposed a tangentReLU (tReLU) to this end. The final layer of the MVC-net is the manifold-valued fully connected layer whose construction is adopted from Chakraborty et al. (2019). Finally, we presented several experiments demonstrating the performance of the MVC-Net on classification problems drawn from medical imaging and computer vision. Comparisons to state-of-the art was presented demonstrating comparable to superior performance of the MVC-Net in terms of clas-sification accuracy, parameter and time/epoch efficiency of the MVC-Net.

Afsari, B. Riemannian  Lp center of mass: Existence, uniqueness, and convexity. Proceedings of the American Mathematical Society, 139(02):655–655, 2011. ISSN 0002-9939. doi: 10.1090/S0002-9939-2010-10541-5.

Archer, D., Vaillancourt, D., and Coombes, S. A template and probabilistic atlas of the human sensorimotor tracts using diffusion mri. Cerebral Cortex, 28:1–15, 03 2017. doi: 10.1093/cercor/bhx066.

Banerjee, M., Chakraborty, R., Ofori, E., Okun, M. S., Vial- lancourt, D. E., and Vemuri, B. C. A nonlinear regression technique for manifold valued data with applications to medical image analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4424–4432, 2016.

Banerjee, M., Chakraborty, R., Archer, D., Vaillancourt, D., and Vemuri, B. C. Dmr-cnn: A cnn tailored for dmr scans with applications to pd classification. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 388–391. IEEE, 2019.

Basser, P. J., Mattiello, J., and LeBihan, D. Mr diffusion tensor spectroscopy and imaging. Biophysical journal, 66(1):259–267, 1994.

Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Van- dergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4): 18–42, 2017.

Brooks, D., Schwander, O., Barbaresco, F., Schneider, J.-Y., and Cord, M. Riemannian batch normalization for spd neural networks. arXiv preprint arXiv:1909.02414, 2019.

Chakraborty, R., Bouza, J., Manton, J., and Vemuri, B. C. A deep neural network for manifold-valued data with applications to neuroimaging. In International Conference on Information Processing in Medical Imaging, pp. 112–124. Springer, 2019.

Cheng, G., Vemuri, B. C., Carney, P. R., and Mareci, T. H. Non-rigid registration of high angular resolution diffusion images represented by gaussian mixture fields. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 190–197. Springer, 2009.

Cohen, T. S., Geiger, M., K¨ohler, J., and Welling, M. Spherical CNNs. arXiv preprint arXiv:1801.10130, 2018.

Esteves, C., Allen-Blanchette, C., Makadia, A., and Dani- ilidis, K. Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–68, 2018.

Fonov, V., Evans, A. C., Botteron, K., Almli, C. R., McK- instry, R. C., Collins, D. L., and Group, B. D. C. Unbiased average age-appropriate atlases for pediatric studies. Neuroimage, 54(1):313–327, 2011.

Garyfallidis, E., Brett, M., Amirbekian, B., Rokem, A., Van Der Walt, S., Descoteaux, M., and Nimmo-Smith, I. Dipy, a library for the analysis of diffusion mri data. Frontiers in Neuroinformatics, 8:8, 2014. ISSN 1662-5196. doi: 10.3389/fninf.2014.00008.

Groisser, D. Newton’s method, zeroes of vector fields, and the Riemannian center of mass. Advances in Applied Mathematics, 33(1):95–135, 2004. ISSN 01968858. doi: 10.1016/j.aam.2003.08.003.

Huang, Z. and Van Gool, L. J. A Riemannian Network for SPD Matrix Learning. In AAAI, volume 1, pp. 3, 2017.

Huang, Z., Wu, J., and Van Gool, L. Building deep net- works on grassmann manifolds. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Johansen-Berg, H. and Behrens, T. E. Diffusion MRI: from quantitative measurement to in vivo neuroanatomy. Academic Press, 2013.

Kondor, R. and Trivedi, S. On the generalization of equivari- ance and convolution in neural networks to the action of compact groups. arXiv preprint arXiv:1802.03690, 2018.

Kondor, R., Lin, Z., and Trivedi, S. Clebsch–gordan nets: a fully fourier space spherical convolutional neural network. In Advances in Neural Information Processing Systems, pp. 10117–10126, 2018.

Lee, J. M. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media, 2006.

Mallat, S. Understanding Deep Convolutional Networks. Philosophical Transactions A, 374:20150203, 2016. ISSN 1364503X. doi: 10.1098/rsta.2015.0203.

Masci, J., Boscaini, D., Bronstein, M., and Vandergheynst, P. Geodesic convolutional neural networks on riemannian manifolds. In Proc. of the IEEE Intl. Conf. on Computer Vision workshops, pp. 37–45, 2015.

Maurice Fr´echet. Les ´el´ements al´eatoires de nature quelconque dans un espace distanci´e. Annales de l’I. H. P.,, 10(4):215–310, 1948.

Poulenard, A. and Ovsjanikov, M. Multi-directional geodesic neural networks via equivariant convolution. In SIGGRAPH Asia 2018 Technical Papers, pp. 236. ACM, 2018.

Srivastava, N., Mansimov, E., and Salakhutdinov, R. Un- supervised learning of video representations using lstms. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 843–852. JMLR.org, 2015.

Tuch, D. S., Reese, T. G., Wiegell, M. R., Makris, N., Bel- liveau, J. W., and Wedeen, V. J. High angular resolution diffusion imaging reveals intravoxel white matter fiber heterogeneity. Magnetic Resonance in Medicine, 48(4): 577–582, 2002. doi: 10.1002/mrm.10268.

Yu, K. and Salzmann, M. Second-order Convolutional Neu- ral Networks. ArXiv e-prints, March 2017.


Designed for Accessibility and to further Open Science