Stochastic variational inference (SVI) methods provide an appealing way of scaling probabilistic modeling to large scale data. These methods transform the problem of computing an intractable posterior distribution to finding the best approximation within a class of tractable probability distributions (Hoffman et al., 2013). Using tractable classes of approximate distributions, e.g., mean-field, and Bethe
1McGill University 2Mila 3University of Toronto 4Vector Institute. Correspondence to: Joey Bose <joey.bose@mail.mcgill.ca>, Ariella Smofsky <ariella.smofsky@mail.mcgill.ca>.
Proceedings of the International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).
Figure 1. The shortest path between a given pair of node embeddings in and hyperbolic space as modelled by the Lorentz model
and Poincar´e disk
. Unlike Euclidean space, dis- tances between points grow exponentially as you move away from the origin in hyperbolic space, and thus the shortest paths between points in hyperbolic space go through a common parent node (i.e., the origin), giving rise to hierarchical and tree-like structure.
approximations, facilitates efficient inference, at the cost of limiting the expressiveness of the learned posterior.
In recent years, the power of these SVI methods has been further improved by employing normalizing flows, which greatly increase the flexibility of the approximate posterior distribution. Normalizing flows involve learning a series of invertible transformations, which are used to transform a sample from a simple base distribution to a sample from a richer distribution (Rezende & Mohamed, 2015). Indeed, flow-based posteriors enjoy many advantages such as efficient sampling, exact likelihood estimation, and lowvariance gradient estimates when the base distribution is reparametrizable, making them ideal for modern machine learning problems. There have been numerous advances in normalizing flow construction in Euclidean spaces, such as RealNVP (Dinh et al., 2017), B-NAF (Huang et al., 2018; De Cao et al., 2019), and FFJORD (Grathwohl et al., 2018), to name a few.
However, current normalizing flows are restricted to Euclidean space, and as a result, these approaches are ill-equipped to model data with an underlying hierarchical structure. Many real-world datasets—such as ontologies, social networks, sentences in natural language, and evolutionary relationships between biological entities in phylogenetics—exhibit rich hierarchical or tree-like structure. Hierarchical data of this kind can be naturally represented in hyperbolic spaces, i.e., non-Euclidean spaces with constant negative curvature (Figure 1). But Euclidean normalizing flows fail to incorporate these structural inductive biases, since Euclidean space cannot embed deep hierarchies without suffering from high distortion (Sarkar, 2011). Furthermore, sampling from densities defined on Euclidean space will inevitability generate points that do not lie on the underlying hyperbolic space.
Present work. To address this fundamental limitation, we present the first extension of normalizing flows to hyperbolic spaces. Prior works have considered learning models with hyperbolic parameters (Liu et al., 2019b; Nickel & Kiela, 2018) as well as variational inference with hyperbolic latent variables (Nagano et al., 2019; Mathieu et al., 2019), but our work represents the first approach to allow flexible density estimation in hyperbolic space.
To define our normalizing flows we leverage the Lorentz model of hyperbolic geometry and introduce two new forms of coupling, Tangent Coupling (T C) and Wrapped Hyperboloid Coupling (WHC). These define flexible and invertible transformations capable of transforming sampled points in the hyperbolic space. We derive the change of volume associated with these transformations and show that it can be computed efficiently with O(n) cost, where n is the dimension of the hyperbolic space. We empirically validate our proposed normalizing flows on structured density estimation, reconstruction, and generation tasks on hierarchical data, highlighting the utility of our proposed approach.
Within the Riemannian geometry framework, hyperbolic spaces are manifolds with constant negative curvature K and are of particular interest for embedding hierarchical structures. There are multiple models of n-dimensional hyperbolic space, such as the hyperboloid , also known as the Lorentz model, or the Poincar´e ball
. Figure 1 illustrates some key properties of
and
, highlight- ing how distances grow exponentially as you move away from the origin and how the shortest paths between distant points tend to go through a common parent (i.e., the origin), giving rise to a hierarchical or tree-like structure. In the next section, we briefly review the Lorentz model of hyperbolic geometry. We are not assuming a background in Riemannian geometry, though Appendix A and Ratcliffe (1994) are of use to the interested reader. Henceforth, for notational clarity, we use boldface font to denote points on the hyperboloid manifold.
2.1. Lorentz Model of Hyperbolic Geometry
An n-dimensional hyperbolic space, , is the unique, com- plete, simply-connected n-dimensional Riemannian manifold of constant negative curvature, K. For our purposes, the Lorentz model is the most convenient representation of hyperbolic space, since it is equipped with relatively simple explicit formulas and useful numerical stability properties (Nickel & Kiela, 2018). We choose the 2D Poincar´e disk
to visualize hyperbolic space because of its conformal mapping to the unit disk. The Lorentz model embeds hyperbolic space
-dimensional Minkowski space, defined as the manifold
equipped with the following inner product:
which has the type . It is common to denote this space as
to emphasize the distinct role of the zeroth coordinate. In the Lorentz model, we model hyperbolic space as the (upper sheet of) the hyperboloid embedded in Minkowski space. It is a remarkable fact that though the Lorentzian metric (Eq. 1) is in-definite, the induced Riemannian metric
on the unit hyperboloid is positive definite (Ratcliffe, 1994). The n-Hyperbolic space with constant negative curvature K with origin o = (1/K, 0, . . . , 0), is a Riemannian manifold
Equipped with this, the induced distance between two points is given by
The tangent space to the hyperboloid at the point p can also be described as an embedded subspace of
. It is given by the set of points satisfying the orthogonality relation with respect to the Minkowski inner product,1
Of special interest are vectors in the tangent space at the origin of whose norm under the Minkowski inner product is equivalent to the conventional Euclidean norm. That is
is a vector such that
and
. Thus at the origin the partial derivatives with respect to the ambient coordinates,
, define the covariant derivative.
Projections. Starting from the extrinsic view by which we consider , we may project any vector
on to the hyperboloid using the shortest Euclidean distance:
Furthermore, by definition a point on the hyperboloid sat-isfies and thus when provided with n coordinates
we can always determine the missing coordinate to get a point on
Exponential Map. The exponential map takes a vector, v, in the tangent space of a point x to a point on the manifold—i.e., y
by moving a unit length along the geodesic,
(straightest parametric curve), uniquely defined by
with direction
v. The closed form expression for the exponential map is then given by
where we used the generalized radius in place of the curvature.
Logarithmic Map. As the inverse of the exponential map, the logarithmic map takes a point, y, on the manifold back to the tangent space of another point x also on the manifold. In the Lorentz model this is defined as
where
Parallel Transport. The parallel transport for two points xis a map that carries the vectors in
corresponding vectors at
along the geodesic. That is vectors are connected between the two tangent spaces such that the covariant derivative is unchanged. Parallel transport is a map that preserves the metric, i.e.,
and in the Lorentz model is given by
where is as defined above. Another useful property is that the inverse parallel transport simply carries the vectors back along the geodesic and is simply defined as
2.2. Probability Distributions on Hyperbolic Spaces
Probability distributions can be defined on Riemannian manifolds, which include as a special case. One transforms the infinitesimal volume element on the manifold to the corresponding volume element in
as de-fined by the co-ordinate charts. In particular, given the Riemannian manifold M(z) and its metric
, we have
is the Lebesgue measure. We now briefly survey three distinct generalizations of the normal distribution to Riemannian manifolds.
Riemannian Normal. The first is the Riemannian normal (Pennec, 2006; Said et al., 2014), which is derived from maximizing the entropy given a Fr´echet mean and a dispersion parameter
. Specifically, we have
, where
is the induced distance and Z is the normalization constant (Said et al., 2014; Mathieu et al., 2019).
Restricted Normal. One can also restrict sampled points from the normal distribution in the ambient space to the manifold. One example is the Von Mises distribution on the unit circle and its generalized version, i.e., Von Mises-Fisher distribution on the hypersphere (Davidson et al., 2018).
Wrapped Normal. Finally, we can define a wrapped normal distribution (Falorsi et al., 2019; Nagano et al., 2019), which is obtained by (1) sampling from N(0, I) and then transforming it to a point by concatenating 0 as the zeroth coordinate; (2) parallel transporting the sample v from the tangent space at o to the tangent space of another point
on the manifold to obtain u; (3) mapping u from the tangent space to the manifold using the exponential map at
. Sampling from such a distribution is straightforward and the probability density can be obtained via the change of variable formula,
where p(z) is the wrapped normal distribution and p(v) is the normal distribution in the tangent space of o.
We seek to define flexible and learnable distributions on , which will allow us to learn rich approximate posterior distributions for hierarchical data. To do so, we design a class of invertible parametric hyperbolic functions,
. A sample from the approximate posterior can then be obtained by first sampling from a simple base distribution
defined on
and then applying a composition of functions
from this class:
In order to ensure effective and tractable learning, the class of functions must satisfy three key desiderata:
1. Each function must be invertible.
2. We must be able to efficiently sample from the final distribution,
3. We must be able to efficiently compute the associated change in volume (i.e., the Jacobian determinant) of the overall transformation.
Given these requirements, the final transformed distribution is given by the change of variables formula:
Functions satisfying desiderata 1-3 in Euclidean space are often termed normalizing flows (Appendix B), and our work extends this idea to hyperbolic spaces. In the following sections, we describe two flows of increasing complexity: Tangent Coupling (T C) and Wrapped Hyperboloid Coupling (WHC). The first approach lifts a standard Euclidean flow to the tangent space at the origin of the hyperboloid. The second approach modifies the flow to explicitly utilize hyperbolic geometry. Figure 2 illustrates synthetic densities as learned by our approach on
3.1. Tangent Coupling
Similar to the Wrapped Normal distribution (Section 2.2), one strategy to define a normalizing flow on the hyperboloid is to use the tangent space at the origin. That is, we first sample a point from our base distribution—which we define to be a Wrapped Normal—and use the logarithmic map at the origin to transport it to the corresponding tangent space. Once we arrive at the tangent space we are free to apply any Euclidean flow before finally projecting back to the manifold using the exponential map. This approach leverages the fact that the tangent bundle of a hyperbolic manifold has a well-defined vector space structure, allowing affine transformations and other operations that are ill-defined on the manifold itself.
Following this idea, we build upon one of the earliest and most well-studied flows: the RealNVP flow (Dinh et al., 2017). At its core, the RealNVP flow uses a computationally symmetric transformation (affine coupling layer), which has the benefit of being fast to evaluate and invert due to its lower triangular Jacobian, whose determinant is cheap to compute. Operationally, the coupling layer is implemented using a binary mask, and partitions some input into two sets, where the first set,
, is transformed elementwise independently of other dimensions. The second set,
, is also transformed elementwise but in a way that depends on the first set (see Appendix B.2 for more details). Since all coupling layer operations occur at
this form of coupling as Tangent Coupling (T C).
Thus, the overall transformation due to one layer of our T C
Figure 2. Comparison of density estimation in hyperbolic space for 2D wrapped Gaussian (WG) and mixture of wrapped gaussian (MWG) on . Densities are visualized in the Poincar´e disk. Additional qualitative results can be found in Appendix F.
flow is a composition of a logarithmic map, affine coupling defined on , and an exponential map:
where is a point on
is a pointwise non-linearity such as the exponential function. Functions s and t are parameterized scale and translation functions implemented as neural nets from
. One important detail is that arbitrary operations on a tangent vector
may transport the resultant vector outside the tangent space, hampering subsequent operations. To avoid this we can keep the first dimension fixed at
to ensure we remain in
Similar to the Euclidean RealNVP, we need an efficient expression for the Jacobian determinant of
Proposition 1. The Jacobian determinant of a single T C layer in equation 11 is:
where, is as defined above.
Proof Sketch. Here we only provide a sketch of the proof and details can be found in Appendix C. First, observe that the overall transformation is a valid composition of functions: . Thus, the overall determinant can be computed by chain rule and the identity, det
. Tackling each function in the composition individually, det
as derived in Skopek et al. (2019). As the logarithmic map is the inverse of the exponential map the Jacobian determinant is simply the inverse of the determinant of the exponential map, which gives the det
term. For the middle term, we must calculate the directional derivative of
in an orthonormal basis w.r.t. the Lorentz inner product, of
. Since the standard Euclidean basis vectors
are also a basis for
, the Jacobian determinant det
to that of the RealNVP flow, which is lower triangluar and is thus efficiently computable in O(n) time.
It is remarkable that the middle term in Proposition 1 is precisely the same change in volume associated with affine coupling in RealNVP. The change in volume due to the hyperbolic space only manifests itself through the exponential and logarithmic maps, each of which can be computed in O(n) cost. Thus, the overall cost is only slightly larger than the regular Euclidean RealNVP, but still O(n).
3.2. Wrapped Hyperboloid Coupling
Figure 3. Wrapped Hyperbolic Coupling. The left figure depicts a partitioned input point parallel transport. The right figure depicts the
vector after it is transformed, parallel transported, and projected to
The hyperbolic normalizing flow with T C layers discussed above operates purely in the tangent space at the origin. This simplifies the computation of the Jacobian determinant, but anchoring the flow at the origin may hinder its expressive power and its ability to leverage disparate regions of the manifold. In this section, we remedy this shortcoming with a new hyperbolic flow that performs translations between tangent spaces via parallel transport.
We term this transformation Wrapped Hyperboloid Coupling (WHC). As with the T C layer, it is a fully invertible transformation with a tractable ana- lytic form for the Jacobian determinant. To define a WHC layer we first use the logarithmic map at the origin to trans- port a point to the tangent space. We employ the coupling strategy previously discussed and partition our input vector into two components:
and
. Let
be the point on
after the logarithmic map. The remainder of the WHC layer can be defined as follows;
Functions and
are taken to be arbitrary neural nets, but the role of t when compared to T C is vastly different. In particular, the generalization of translation on Riemannian manifolds can be viewed as parallel transport to a different tangent space. Consequently, in Eq. 13, the function t predicts a point on the manifold that we wish to parallel transport to. This greatly increases the flexibility as we are no longer confined to the tangent space at the origin. The logarithmic map is then used to ensure that both
and
are in the same tangent space before the final exponential map that projects the point to the manifold.
One important consideration in the construction of t is that it should only parallel transport functions of . However, the output of t is a point on
and without care this can involve elements in
. To prevent such a scenario we construct the output of
where elements
are used to determine the value of
using Eq. 5, such that it is a point on the manifold and every remaining index is set to zero. Such a construction ensures that only components of any function of
are parallel transported as desired. Figure 3 illustrates the transformation performed by the WHC layer.
Inverse of WHC. To invert the flow it is sufficient to show that argument to the final exponential map at the origin itself is invertible. Furthermore, note that undergoes an identity mapping and is trivially invertible. Thus, we need to show that the second partition is invertible, i.e. that the following transformation is invertible:
As discussed in Section 2, the parallel transport, exponential map, and logarithmic map all have well-defined inverses with closed forms. Thus, the overall transformation is invertible in closed form:
Properties of WHC. To compute the Jacobian determinant of the full transformation in Eq. 13 we proceed by analyzing the effect of WHC on valid orthonormal bases w.r.t. the Lorentz inner product for the tangent space at the origin. We state our main result here and provide a sketch of the proof, while the entire proof can be found in Appendix D.
Proposition 2. The Jacobian determinant of the function in equation 13 is:
where , the constant
is a non-linearity,
Proof Sketch. We first note that the exponential and logarithmic maps applied at the beginning and end of the WHC can be dealt with by appealing to the chain rule and the known Jacobian determinants for these functions as used in Proposition 1. Thus, what remains is the following term: . To evaluate this term we rely on the following Lemma.
Lemma 1. Let be a function defined as:
Now, define a function on the subspace of
corresponding to the standard basis elements
where denotes the portion of the vector
corresponding to the standard basis elements
and s and t are constants (which depend on
). In Equation equation 17, we use
to denote the vector corresponding to only the dimensions d + 1, ..., n and similarly for
we have that
The proof for Lemma 1 is provided in Appendix D. Using Lemma 1, and the fact that (Nagano et al., 2019) we are left with another composition of functions but on the subspace
. The Jacobian determinant for these functions, are simply that of the logarithmic map, exponential map and the argument to the parallel transport which can be easily computed as
The cost of computing the change in volume for one WHC layer is O(n) which is the same as a T C layer plus the added cost of the two new maps that operate on the lower subspace of basis elements.
We evaluate our T C-flow and WHC-flow on three tasks: structured density estimation, graph reconstruction, and graph generation.2 Throughout our experiments, we rely on three main baselines. In Euclidean space, we use Gaussian latent variables and affine coupling flows (Dinh et al., 2017), denoted N and NC, respectively. In the Lorentz model, we use Wrapped Normal latent variables, H-VAE, as an analogous baseline (Nagano et al., 2019). Since all model parameters are defined on Euclidean tangent spaces, models can be trained with conventional optimizers like Adam (Kingma & Ba, 2014). Following previous work, we also consider the curvature K as a learnable parameter with a warmup of 10 epochs, and we clamp the max norm of vectors to 40 before any logarithmic or exponential map (Skopek et al., 2019). Appendix E contains details on model architectures and implementation details.
4.1. Structured Density Estimation
We first consider structured density estimation in a canonical VAE setting (Kingma & Welling, 2013), where we seek to learn rich approximate posteriors using normalizing flows and evaluate the marginal log-likelihood of test data. Following work on hyperbolic VAEs, we test the approaches on a branching diffusion process (BDP) and dynamically binarized MNIST (Mathieu et al., 2019; Skopek et al., 2019).
To estimate the log likelihood we perform importance sampling using 500 samples from the test set (Burda et al., 2015). Our results are shown in Tables 1 and 2. On both datasets we observe our hyperbolic flows provide improvements when using latent spaces of low dimension. This result matches theoretical expectations—e.g., that trees can be perfectly embedded in —and dovetails with previous work on graph embedding (Nickel & Kiela, 2017), thus highlighting the benefit of leveraging hyperbolic space is most prominent in small dimensions. However, as we increase the latent dimension, the Euclidean approaches can compensate for this intrinsic geometric limitation. In the case of BDP we note that the data is indeed a noisy binary tree, which theoretically can be represented in a 2-D hyperbolic space and thus moving to higher dimensional latent space is not beneficial.
Table 1. Test Log Likelihood on Binary Diffusion Process versus latent dimension. All normalizing flows use 2-coupling layers.
Model MNIST 2 MNIST 4 MNIST 6
Table 2. Test Log Likelihood on MNIST averaged over 5 runs verus latent dimension. * indicates numerically unstable settings.
4.2. Graph Reconstruction
We evaluate the utility of our hyperbolic flows by conducting experiments on the task of link prediction using graph neural networks (GNNs) (Scarselli et al., 2008) as an inference model. Given a simple graph G = (V, A, X), defined by a set of nodes V, an adjacency matrix node feature matrix
, we learn a VGAE (Kipf & Welling, 2016) model whose inference network,
a distribution over node embeddings
the likelihood of an edge existing between pairs of nodes we use an inner product decoder:
, with dot products computed in
when nec- essary. Given these components, the inference GNNs are trained to maximize the variational lower bound on a training set of edges.
We use two different disease datasets taken from (Chami et al., 2019) and (Mathieu et al., 2019)3 for evaluation purposes. Our chosen datasets reflect important real world use cases where the data is known to contain hierarchies. One such measure to determine how tree-like a given graph is known to be Gromovs -hyperbolicity and traditional link prediction datasets such as Cora and Pubmed (Yang et al., 2016) were found to lack such a property and are not suitable candidates to evaluate our proposed approach (Chami et al., 2019). The first dataset Diseases-I is composed of a network of disorders and disease genes linked by the known disordergene associations (Goh et al., 2007). In the second dataset Diseases-II, we build tree networks of a SIR disease spreading model (Anderson et al., 1992), where node features determine the susceptibility to the disease. In Table 3 we report the AUC and average precision (AP) on the test set. We observe consistent improvements when using
Model Dis-I AUC Dis-I AP Dis-II AUC Dis-II AP
Table 3. Test AUC and Test AP on Graph Embeddings where Dis-I has latent dimesion 6 and Dis-II has latent dimension 2.
4.3. Graph Generation
Finally, we explore the utility of our hyperbolic flows for generating hierarchical structures. As a synthetic testbed, we construct datasets containing uniformly random trees as well as uniformly random lobster graphs (Golomb, 1996), where each graph contains between 20 to 100 nodes. Unlike prior work on graph generation—i.e., (Liu et al., 2019a)— our datasets are designed to have explicit hierarchies, thus enabling us to test the utility of hyperbolic generative models. We then train a generative model to learn the distribution of these graphs. We expect the hyperbolic flows to provide a significant benefit for generating valid random trees, as well as learning the distribution of lobster graphs, which are a special subset of trees.
We follow the two-stage training procedure outlined in Graph Normalizing Flows (Liu et al., 2019a) in that we first train an autoencoder to give node-level latents on which we train an normalizing flow for density estimation. Empirically, we find that using GRevNets (Liu et al., 2019a) and defining edge probabilities using a distance-based decoder consistently leads to better generation performance. Thus, we define edge probabilities as are learned edge spe-cific bias and temperature parameters. At inference time, we first sample the number of nodes to generate from the empirical distribution of the dataset. We then independently sample node latents from our prior, beginning with a fully connected graph, and then push these samples through our learned flow to give refined edge probabilities.
To evaluate the various approaches, we construct 100 training graphs for each dataset to train our model. Figure 4 shows representative samples generated by the various approaches. We see that hyperbolic normalizing flows learn to generate tree-like graphs and also match the specific properties of the lobster graph distribution, whereas the Euclidean flow model tends to generate densely connected graphs with many cycles (or else disconnected graphs). To quantify these intuitions, Table 4 contains statistics on how often
Figure 4. Selected qualitative results on graph generation for lobster and random tree graph.
Table 4. Generation statistics on random trees over 5 runs.
Figure 5. MMD scores for graph generation on Lobster graphs. Note, that NC achieves 0% accuracy.
the different models generate valid trees (denoted by “accuracy”), as well as the average number of triangles and the average global clustering coefficients for the generated graphs. Since the target data is random trees, a perfect model would achieve 100% accuracy, with no triangles, and a global clustering of 0 for all graphs. As a representative Euclidean baseline we employ Graph Normalizing Flows (GNFs) which is denoted as NC in Table 4 and Figure 5. We see that the hyperbolic models generate valid trees more often, and they generate graphs with fewer triangles and lower clustering on average. Finally, to evaluate how well the models match the specific properties of the lobster graphs, we follow Liao et al. (2019) and report the MMD distance between the generated graphs and a test set for various graph statistics (Figure 5). Again, we see that the hyperbolic approaches significantly outperform the Euclidean normalizing flow.
Hyperbolic Geometry in Machine Learning:. The intersection of hyperbolic geometry and machine learning has recently risen to prominence (Dhingra et al., 2018; Tay et al., 2018; Law et al., 2019; Khrulkov et al., 2019; Ovinnikov, 2019). Early prior work proposed to embed data into the Poincar´e ball model (Nickel & Kiela, 2017; Chamberlain et al., 2017). The equivalent Lorentz model was later shown to have better numerical stability properties (Nickel & Kiela, 2018), and recent work has leveraged even more stable tiling approaches (Yu & De Sa, 2019). In addition, there exists a burgeoning literature of hyperbolic counterparts to conventional deep learning modules on Euclidean spaces (e.g., matrix multiplication), enabling the construction of hyperbolic neural networks (HNNs) (Gulcehre et al., 2018; Ganea et al., 2018) with further extensions to graph data using hyperbolic GNN architectures (Liu et al., 2019a; Chami et al., 2019). Latent variable models on hyperbolic space have also been investigated in the context of VAEs, using generalizations of the normal distribution (Nagano et al., 2019; Mathieu et al., 2019). In contrast, our work learns a flexible approximate posterior using a novel normalizing flow designed to use the geometric structure of hyperbolic spaces. In addition to work on hyperbolic VAEs, there are also several works that explore other non-Euclidean spaces (e.g., spherical VAEs) (Davidson et al., 2018; Falorsi et al., 2019; Grattarola et al., 2019).
Learning Implicit Distributions. In contrast with exact likelihood methods there is growing interest in learning implicit distributions for generative modelling. Popular approaches include density ratio estimation methods using a parametric classifiers such as GANS (Goodfellow et al., 2014), and kernel based estimators (Shi et al., 2017). In the context of autoencoders learning implicit latent distribution can be seen as an adversarial game minimizing a specific divergence (Makhzani et al., 2015) or distance (Tolstikhin et al., 2017). Instead of adversarial formulations implicit distributions may also be learned directly by estimating the gradients of log density function using the Stein gradient estimator (Li & Turner, 2017). Finally, such gradient estimators can also be used to power variational inference with implicit posteriors enabling the use of posterior families with intractable densities (Shi et al., 2018).
Normalizing Flows:. Normalizing flows (NFs) (Rezende & Mohamed, 2015; Dinh et al., 2017) are a class of probabilistic models which use invertible transformations to map samples from a simple base distribution to samples from a more complex learned distribution. While there are many classes of normalizing flows (Papamakarios et al., 2019; Kobyzev et al., 2019), our work largely follows normalizing flows designed with partially-ordered dependencies, as found in affine coupling transformations (Dinh et al., 2017). Recently, normalizing flows have also been extended to Riemannian manifolds, such as spherical spaces in Gemici et al. (2016). In parallel to this work, normalizing flows have been extended to toriodal spaces (Rezende et al., 2020) and the data manifold (Brehmer & Cranmer, 2020). Finally, relying on affine coupling and GNNs, Liu et al. (2019a) develop graph normalizing flows (GNFs) for generating graphs. However, unlike our approach GNFs do not benefit from the rich geometry of hyperbolic spaces.
In this paper, we introduce two novel normalizing flows on hyperbolic spaces. We show that our flows are efficient to sample from, easy to invert and require only O(n) cost to compute the change in volume. We demonstrate the effectiveness of constructing hyperbolic normalizing flows for latent variable modeling of hierarchical data. We empirically observe improvements in structured density estimation, graph reconstruction and also generative modeling of treestructured data, with large qualitative improvements in generated sample quality compared to Euclidean methods. One important limitation is in the numerical error introduced by clamping operations which prevent the creation of deep flow architectures. We hypothesize that this is an inherent limitation of the Lorentz model, which may be alleviated with newer models of hyperbolic geometry that use integer-based tiling (Yu & De Sa, 2019). In addition, while we considered hyperbolic generalizations of the coupling transforms to define our normalizing flows, designing new classes of invertible transformations like autoregressive and residual flows on non-Euclidean spaces is an interesting direction for future work.
Funding: AJB is supported by an IVADO Excellence Fellowship. RL was supported by Connaught International Scholarship and RBC Fellowship. WLH is supported by a Canada CIFAR AI Chair. This work was also supported by NSERC Discovery Grants held by WLH and PP. In addition the authors would like to thank Chinwei Huang, Maxime Wabartha, Andre Cianflone and Andrea Madotto for helpful feedback on earlier drafts of this work and Kevin Luk, Laurent Dinh and Niky Kamran for helpful technical discussions. The authors would also like to thank the anonymous reviewers for their comments and feedback and Aaron Lou, Derek Lim, and Leo Huang for catching a bug in the code.
Anderson, R. M., Anderson, B., and May, R. M. Infectious diseases of humans: dynamics and control. Oxford university press, 1992.
Brehmer, J. and Cranmer, K. Flows for simultaneous man- ifold learning and density estimation. arXiv preprint arXiv:2003.13913, 2020.
Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
Chamberlain, B. P., Clough, J., and Deisenroth, M. P. Neural embeddings of graphs in hyperbolic space. arXiv preprint arXiv:1705.10359, 2017.
Chami, I., Ying, Z., R´e, C., and Leskovec, J. Hyperbolic graph convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 4869–4880, 2019.
Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. Hyperspherical variational auto-encoders. arXiv preprint arXiv:1804.00891, 2018.
De Cao, N., Titov, I., and Aziz, W. Block neural autoregres- sive flow. arXiv preprint arXiv:1904.04676, 2019.
Dhingra, B., Shallue, C. J., Norouzi, M., Dai, A. M., and Dahl, G. E. Embedding text in hyperbolic spaces. arXiv preprint arXiv:1806.04313, 2018.
Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima- tion using real nvp. In The 5th International Conference on Learning Representations (ICLR), Vancouver, 2017.
Falorsi, L., de Haan, P., Davidson, T. R., and Forr´e, P. Reparameterizing distributions on lie groups. arXiv preprint arXiv:1903.02958, 2019.
Ganea, O., B´ecigneul, G., and Hofmann, T. Hyperbolic neural networks. In Advances in neural information processing systems, pp. 5345–5355, 2018.
Gemici, M. C., Rezende, D., and Mohamed, S. Normal- izing flows on riemannian manifolds. arXiv preprint arXiv:1611.02304, 2016.
Goh, K.-I., Cusick, M. E., Valle, D., Childs, B., Vidal, M., and Barab´asi, A.-L. The human disease network. Proceedings of the National Academy of Sciences, 104 (21):8685–8690, 2007.
Golomb, S. W. Polyominoes: puzzles, patterns, problems, and packings, volume 16. Princeton University Press, 1996.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
Grattarola, D., Livi, L., and Alippi, C. Adversarial autoen- coders with constant-curvature latent manifolds. Applied Soft Computing, 81:105511, 2019.
Gulcehre, C., Denil, M., Malinowski, M., Razavi, A., Pas- canu, R., Hermann, K. M., Battaglia, P., Bapst, V., Raposo, D., Santoro, A., et al. Hyperbolic attention networks. arXiv preprint arXiv:1805.09786, 2018.
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows. In Proceedings of the 35th international conference on Machine learning, 2018.
Khrulkov, V., Mirvakhabova, L., Ustinova, E., Oseledets, I., and Lempitsky, V. Hyperbolic image embeddings. arXiv preprint arXiv:1904.02239, 2019.
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Kipf, T. N. and Welling, M. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
Kobyzev, I., Prince, S., and Brubaker, M. A. Normalizing flows: Introduction and ideas. arXiv preprint arXiv:1908.09257, 2019.
Law, M., Liao, R., Snell, J., and Zemel, R. Lorentzian dis- tance learning for hyperbolic representations. In International Conference on Machine Learning, pp. 3672–3681, 2019.
Li, Y. and Turner, R. E. Gradient estimators for implicit models. arXiv preprint arXiv:1705.07107, 2017.
Liao, R., Li, Y., Song, Y., Wang, S., Hamilton, W., Duve- naud, D. K., Urtasun, R., and Zemel, R. Efficient graph generation with graph recurrent attention networks. In Advances in Neural Information Processing Systems, pp. 4257–4267, 2019.
Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. Graph normalizing flows. In Advances in Neural Information Processing Systems, pp. 13556–13566, 2019a.
Liu, Q., Nickel, M., and Kiela, D. Hyperbolic graph neural networks. In Advances in Neural Information Processing Systems, pp. 8228–8239, 2019b.
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
Mathieu, E., Le Lan, C., Maddison, C. J., Tomioka, R., and Teh, Y. W. Continuous hierarchical representations with poincar´e variational auto-encoders. In Advances in neural information processing systems, pp. 12544–12555, 2019.
Nagano, Y., Yamaguchi, S., Fujita, Y., and Koyama, M. A wrapped normal distribution on hyperbolic space for gradient-based learning. In International Conference on Machine Learning, pp. 4693–4702, 2019.
Nickel, M. and Kiela, D. Poincar´e embeddings for learning hierarchical representations. In Advances in neural information processing systems, pp. 6338–6347, 2017.
Nickel, M. and Kiela, D. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. arXiv preprint arXiv:1806.03417, 2018.
Ovinnikov, I. Poincar\’e wasserstein autoencoder. arXiv preprint arXiv:1901.01427, 2019.
Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762, 2019.
Pennec, X. Intrinsic statistics on riemannian manifolds: Ba- sic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127, 2006.
Ratcliffe, J. G. Foundations of Hyperbolic Manifolds. Number 149 in Graduate Texts in Mathematics. SpringerVerlag, 1994.
Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd international conference on Machine learning. ACM, 2015.
Rezende, D. J., Papamakarios, G., Racani`ere, S., Albergo, M. S., Kanwar, G., Shanahan, P. E., and Cranmer, K. Normalizing flows on tori and spheres. arXiv preprint arXiv:2002.02428, 2020.
Said, S., Bombrun, L., and Berthoumieu, Y. New rieman- nian priors on the univariate normal model. Entropy, 16 (7):4015–4031, 2014.
Sarkar, R. Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pp. 355–366. Springer, 2011.
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
Shi, J., Sun, S., and Zhu, J. Kernel implicit variational inference. arXiv preprint arXiv:1705.10119, 2017.
Shi, J., Sun, S., and Zhu, J. A spectral approach to gradi- ent estimation for implicit distributions. arXiv preprint arXiv:1806.02925, 2018.
Skopek, O., Ganea, O.-E., and B´ecigneul, G. Mixedcurvature variational autoencoders. arXiv preprint arXiv:1911.08411, 2019.
Tay, Y., Tuan, L. A., and Hui, S. C. Hyperbolic repre- sentation learning for fast and efficient neural question answering. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 583–591, 2018.
Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
Yang, Z., Cohen, W. W., and Salakhutdinov, R. Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861, 2016.
Yu, T. and De Sa, C. M. Numerically accurate hyperbolic embeddings using tiling-based models. In Advances in Neural Information Processing Systems, pp. 2021–2031, 2019.
An n-dimensional manifold is a topological space that is equipped with a family of open sets which cover the space and a family of functions
that are homeomorphisms between the
and open subsets of R. The pairs
are called charts. A crucial requirement is that if two open sets
and
intersect in a region, call it
, then the composite map
restricted to
is infinitely differentiable. If
dimensional manifold then a chart,
maps an open subset U to an open subset
. Furthermore, the image of the point
the local coordinates of p on the chart
. Examples of manifolds include
, the Hypersphere
, the Hyperboloid
torus. In this paper we take an extrinsic view of the geometry, that is to say a manifold can be thought of as being embedded in a higher dimensional Euclidean space, —i.e.
, and inherits the coordinate system of the ambient space. This is not how the subject is usually developed but for spaces of constant curvature one gets convenient formulas.
Tangent Spaces. Let be a point on an
dimensional smooth manifold and let
be a differentiable parametric curve with parameter
passing through the point such that
. Since M is a smooth manifold we can trace the curve in local coordinates via a chart
and the entire curve is given in local coordinates by
The tangent vector to this curve at p is then simply
. Another interpretation of the tangent vector of
interpreting the point p as a position vector and the tangent vector is then interpreted as the velocity vector at that point. Using this definition the set of all tangent vectors at p is denoted as
, and is called the tangent space at p.
Riemannian Manifold. A Riemannian metric tensor g on a smooth manifold M is defined as a family of inner products such that at each point the inner product takes vectors from the tangent space at
This means g is defined for every point on M and varies smoothly. Locally, g can be defined using the basis vectors of the tangent space
. In matrix form the Riemannian metric, G(p), can be expressed as,
. A smooth manifold manifold M which is equipped with a Riemannian metric at every point
is called a Riemannian manifold. Thus every Riemannian manifold is specified as the tuple (M, g) which define the smooth manifold and its associated Riemannian metric tensor.
Armed with a Riemannian manifold we can now recover some conventional geometric insights such as the length of a parametric curve , the distance between two points on the manifold, local notion of angle, surface area and volume. We define the length of a curve,
. This definition is very similar to the length of a curve on Euclidean spaces if we just observe that the Riemannian metric is
. Now turning to the distance between points p and q we can reason that it must be the smallest or distance minimizing parametric curve between the points which in the literature are known as geodesics4. Stated another way:
. A norm is induced on every tangent space by
and is defined as
. Finally, we can also define an infitisimal volume element on each tangent space and as a result measure
being the Lebesgue measure.
Given a parametrized density on normalizing flow defines a sequence of invertible transformations to a more complex density over the same space via the change of variable formula for probability distributions (Rezende & Mohamed, 2015). Starting from a sample from a base distribution,
, a mapping
, with parameters
that is both invertible and smooth, the log density of
is defined as
. Where,
is the probability of the transformed sample and
is the Jacobian of f. To construct arbitrarily complex densities a chain of functions of the same form as f can be defined and through successive application of change of density for each invertible transformation in the flow. Thus the final sample from a flow is then given by
and it’s corresponding density can be determined simply by
. Of practical importance when designing normalizing flows is the cost associated with computed the log determinant of the Jacobian which is computationally expensive and can range anywhere from
for an arbitrary matrix and a chosen algorithm. However, through an appropriate choice of f this computation cost can be brought down significantly. While there are many different choices for the transformation function, f, in this work we consider only RealNVP based flows as presented in (Dinh et al., 2017) and (Rezende & Mohamed, 2015) due to their simplicity and expressive power in capturing complex data distributions.
B.1. Variational Inference with Normalizing Flows
One obvious use case for Normalizing Flows is in learning a more expressive often multi-modal posterior distribution needed in Variational Inference. Recall that a variational approximation is a lower bound to the data log-likelihood. Take for example amortized variational inference in a VAE like setting whereby the posterior is parameterized and is amenable to gradient based optimization. The overall objective with both encoder and decoder networks:
The tightness of the Evidence Lower Bound (ELBO) also known as the negative free energy of the system, , is determined by the quality of the posterior approximation to the true posterior. Thus, one way to enrich the posterior approximation is by letting
be a normalizing flow itself and the resultant latent code be the output of the transformation. If we denote
the probability of the latent code
under the base distribution and
as the latent code after K flow layers we may rewrite the Free Energy as follows:
For convenience we may take which is a reparametrized gaussian density and p(z) = N(0, I) a standard normal.
B.2. Euclidean RealNVP
Computing the Jacobian of functions with high-dimensional domain and codomain and computing the determinants of large matrices are in general computationally very expensive. Further complications can arise with the restriction to bijective functions make for difficult modelling of arbitrary distributions. A simple way to significantly reduce the computational burden is to design transformations such that the Jacobian matrix is triangular resulting in a determinant which is simply the product of the diagonal elements. In (Dinh et al., 2017), real valued non-volume preserving (RealNVP) transformations are introduced as simple bijections that can be stacked but yet retain the property of having the composition of transformations having a triangular determinant. To achieve this each bijection updates a part of the input vector using a function that is simple to invert, but which depends on the remainder of the input vector in a complex way. Such transformations are denoted as affine coupling layers. Formally, given a D dimensional input x and d < D, the output y of an affine coupling layer follows the equations:
Where, s and t are parameterized scale and translation functions. As the second part of the input depends on the first, it is easy to see that the Jacobian given by this transformation is lower triangular. Similarly, the inverse of this transformation is given by:
Note that the form of the inverse does not depend on calculating the inverses of either s or t allowing them to be complex functions themselves. Further note that with this simple bijection part of the input vector is never touched which can limit the expressiveness of the model. A simple remedy to this is to simply reverse the elements that undergo scale and translation transformations prior to the next coupling layer. Such an alternating pattern ensures that each dimension of the input vector depends in a complex way given a stack of couplings allowing for more expressive models. Finally, the Jacobian of this transformation is a lower triangular matrix,
We now derive the change in volume formula associated with one T C layer. Without loss of generality we first define a binary mask which we use to partition the elements of a vector at into two sets. Thus b is defined as
Note that all T C layer operations exclude the first dimension which is always copied over by setting and ensures that the resulting sample always remains on
. Utilizing b we may rewrite Equation 11 as,
where is a point on the tangent space at o. Similar to the Euclidean RealNVP, we wish to calculate the jacobian determinant of this overall transformation. We do so by first observing that the overall transformation is a valid composition of functions:
is the flow in tangent space. Utilizing the chain rule and the identity that the determinant of a product is the product of the determinants of its constituents we may decompose the jacobian determinant as,
Tackling each term on RHS of Eq. 32 individually, det
2019). As the logarithmic map is the inverse of the exponential map the jacobian determinant is also the inverse —i.e.
. For the middle term in Eq. 32 we proceed by selecting the standard basis
which is an orthonormal basis with respect to the Lorentz inner product. The directional derivative with respect to a basis element
is computed as follows:
As is a binary mask, it is easy to see that if
then only the first term on the RHS remains and the directional derivative with respect to
is simply the basis vector itself. Conversely, if
then the first term goes to zero and we are left with the second term,
Where in the second line we’ve used the fact . All together, the directional derivatives computed using our chosen basis elements are,
The volume factor given by this linear map is det, where G is the matrix of all directional derivatives. As the basis elements are orthogonal all non-diagonal entries of
go to zero and the determinant is the product of the Lorentz norms of each component. As
the overall determinant is then d
. Finally, the full log jacobian determinant of a T C layer is given by,
Thus the overall computational cost is only slightly larger than the regular Euclidean RealNVP, O(n).
We consider the following function , which we use to define a normalizing flow in n-dimensional hyperbolic space (represented via the Lorentz model):
where is the projection of
to the tangent space at the origin, i.e,
utilize a binary mask b so that
where 0 < d < n. In Equation equation 34 the function is an arbitrary function on the tangent space at the origin and
denotes the logistic function. The function
is a map from the tangent space at the origin to a subset of hyperbolic space defined by the set of points satisfying the condition that
(under their representation in the Lorentz model).
To do so, we will use the following facts without proof or justification:
• Fact 1: The chain rule for determinants, i.e., the fact that
Fact 2 and Fact 4 are proven in Nagano et al. (2019) “A Wrapped Normal Distribution on Hyperbolic Space for GradientBased Learning” for and rederived for general K in Skopek et al. (2019). Fact 3 follows from the fact that the determinant of the inverse of a function is the inverse of that function’s determinant. We will use similar arguments to obtain our determinant as were used in Nagano et al. (2019), and we refer the reader to Appendix A.3 in their work for background.
Our main claim is as follows
Proposition 3. The Jacobian determinant of the function in equation 13 is:
the argument to the parallel transport q is,
by Fact 3. Thus, we are left with the term
To evaluate this term, we rely on the following Lemma:
Now, define a function which acts on the subspace of
corresponding to the standard basis elements
where denotes the portion of the vector
corresponding to the standard basis elements
constants (which depend on
). In equation 45, we use
to denote the vector corresponding to only the dimensions d + 1, ..., n and similarily for
. Then we have that
Proof. First note that by design we have that
i.e., the output of is equal to right hand side of Equation equation 44 after prepending/concatenating 0s to the output of
by examining the directional derivative with respect to a set of basis elements of . Now, given that this is the tangent space at the origin, we know that the standard (i.e., Euclidean) basis elements
form a valid basis for this subspace, since they are orthogonal under the Lorentz normal and orthogonal to the origin itself. Now, we can note first that
In other words, the directional derivative for the first d basis elements is the simply the basis elements themselves. This can be verified by taking the definition of the directional derivative:
and noting that the
term must equal zero since by design. Now, for the basis elements
we have that
This holds because
since by design and because
due to the term inside the parallel transport and by our design of the function t. Together, these facts give that the Jacobian matrix for h (under the basis
) has the following block form:
This can again be done by the chain rule, where we use Facts 2-4 to compute the determinant for exponential map, logarithmic map, and parallel transport. Finally, the Jacobian determinant for the term
can easily be computed as since the standard Euclidean basis is a basis for the tangent space at the origin as shown in Appendix B.2.
In this section we provide more details regarding model architectures and hyperparameters for each experiment in 4. For all hyperbolic models we used a curvature warmup for 10 epochs which aids in numerical stability Skopek et al. (2019). Specifically, we set R = 11 and linearly decrease to R = 2 every epoch after which it is treated as a learnable parameter.
Structured Density Estimation. For all VAE models our encoder consists of three linear layers. The first layer maps the input to a hidden space and the other two layers are used to paramaterize the mean and variance of the prior distribution and map samples to the latent space. The decoder for these models is simply a small MLP that consists of two linear layers that map the latent space to the hidden space and then finally back to the observation space. One important distinction between Euclidean models and hyperbolic models is that we use aFor BDP the hidden dim size is 200 while for MNIST we use 600 and the latent space is varied as shown in Tables 1 and 2. All flow models used in this setting consist of 2 linear layers each of size 128. Between each layer in either the encoder and decoder we use the LeakyRelu (Xu et al., 2015) activation function while tanh is used between flow layers. Lastly, we train all models for 80 epochs with the Adam optimizer with default setting (Kingma & Ba, 2014).
Graph Reconstruction. For graph reconstruction task we use the VGAE model as a base (Kipf & Welling, 2016) which also uses three linear layers of size 16 as the encoder in the VAE model. The decoder however is parameter less and is simply an inner product either in Euclidean space or in for Hyperbolic models. As the reconstruction objective contains
terms we rescale the
penalty by a factor of 1/N such that each of the losses are on the same scale. This can be understood as a
VAE like model where
. Like the structured density estimation setting all our flow models consist of two linear layers of size 128 with a tanh nonlinearity. Finally, we train the each model for 3000 epochs using the Adam optimizer (Kingma & Ba, 2014).
Graph Generation. For the graph generation task we adapt the training setup from (Liu et al., 2019a) in that we pretrain a graph autoencoder for 100 epochs to generate node latents. Empirically, we found that using a VGAE model for hyperbolic space worked better than a vanilla a GAE model. Furthermore, instead of using simple linear layers for the encoder we use GAT (Veliˇckovi´c et al., 2017) layer of size 32, which has access to the adjacency matrix. We use LeakyReLU for our encoder non-linearity while tanh is used for all flow models. Unlike GRevNets that use node features sampled from N(0, I) we find that it is necessary to provide the actual adjacency matrix otherwise training did not succeed. Our decoder defines edge probabilities as are learned edge specific bias and temperature parameters implemented as one GAT layer followed by a linear layer both of size 32. Thus both the encoder and decoder are both parameterized and optimized using the Adam optimizer (Kingma & Ba, 2014).
We now provide additional qualitative results for density estimation in hyperbolic space as visualized in the Poincar´e disk. For these visualizations we take a density initially defined on Euclidean space and project them to the hyperboloid using the logarithmic map at the origin. We then sample 500 points from this new density and fit both T C and WHC based flows. The results for the learned densities are shown below in Figure 6.
Figure 6. Top: Wrapped Gaussian with Checkerboard pattern projected to hyperbolic space. Bot: 2D Spiral projected to hyperbolic space
Upon inspecting the code and data kindly provided by Mathieu et al. (2019) we uncovered some issues that led to us omitting their CS-PhD and Phylogenetics datasets in our comparisons. In particular, Mathieu et al. (2019) use a decoder in their cross-entropy loss that does not define a proper probability. This appears to have caused optimization issues that artificially deflated the reported performance of all the models investigated in that work. When substituting in the dot product decoder employed in this work, the accuracy of all models increases dramatically. After this change, there is no longer any benefit from employing hyperbolic spaces on these datasets. In particular, after applying this fix, the performance of the hyperbolic VAE used by Mathieu et al. (2019) falls substantially below a Euclidean VAE. Since we expect our hyperbolic flows to only give gains in cases where hyperbolic spaces provide a benefit over Euclidean spaces, these datasets do not provide a meaningful testbed for our proposed approach. Lastly, upon inspecting the code and data in Mathieu et al. (2019), we also found that the columns 1 and 2 in Table 4 of their paper appear to be swapped compared to the results generated by their code.