b

DiscoverSearch
About
My stuff
Normalizing Flows on Tori and Spheres
2020·arXiv
Abstract
Abstract

Normalizing flows are a powerful tool for building expressive distributions in high dimensions. So far, most of the literature has concentrated on learning flows on Euclidean spaces. Some problems however, such as those involving angles, are defined on spaces with more complex geometries, such as tori or spheres. In this paper, we propose and compare expressive and numerically stable flows on such spaces. Our flows are built recursively on the dimension of the space, starting from flows on circles, closed intervals or spheres.

Normalizing flows are a flexible way of defining complex distributions on high-dimensional data. A normalizing flow maps samples from a base distribution  π(u)to samples from a target distribution p(x) via a transformation f as follows:

image

The transformation f is restricted to be a diffeomorphism: it must be invertible and both f and its inverse  f −1must be differentiable. This restriction allows us to calculate the target density p(x) via a change of variables:

image

In practice,  π(u)is often taken to be a simple density that can be easily evaluated and sampled from, and either f or its inverse  f −1are implemented via neural networks such that the Jacobian determinant is efficient to compute.

A normalizing flow implements two operations: sampling via Equation (1), and evaluating the density via Equation (2). These operations have distinct computational requirements: generating samples and evaluating their density requires only f and its Jacobian determinant, whereas evaluating the density of arbitrary datapoints requires only  f −1 and itsJacobian determinant. Thus, the intended usage of the flow dictates whether  f, f −1or both must have efficient implementations. For an overview of various implementations and associated trade-offs, see (Papamakarios et al., 2019).

In most existing implementations of normalizing flows, both u and x are defined to live in the Euclidean space  RD, where D is determined by the data dimensionality. However, this Euclidean formulation is not always suitable, as some datasets are defined on spaces with non-Euclidean geometry. For example, if x represents an angle, its ‘natural habitat’ is the 1-dimensional circle; if x represents the location of a particle in a box with periodic boundary conditions, x is naturally defined on the 3-dimensional torus.

The need for probabilistic modelling of non-Euclidean data often arises in applications where the data is a set of angles, axes or directions (Mardia & Jupp, 2009). Such applications include protein-structure prediction in molecular biology (Hamelryck et al., 2006; Mardia et al., 2007; Boomsma et al., 2008; Shapovalov & Dunbrack Jr, 2011), rock-formation analysis in geology (Peel et al., 2001), and path navigation and motion estimation in robotics (Feiten et al., 2013; Senanayake & Ramos, 2018). Non-Euclidean spaces have also been explored in machine learning, and specifically generative modelling, as latent spaces of variational autoencoders (Davidson et al., 2018; Falorsi et al., 2018; Wang & Wang, 2019; Mathieu et al., 2019; Wang et al., 2019).

A more general formulation of normalizing flows that is suitable for non-Euclidean data is to take  u ∈ Mand x ∈ N, where M and N are differentiable manifolds and f : M → Nis a diffeomorphism between them. A diffi-culty with this formulation is that M and N are diffeomorphic by definition, so they must have the same topological properties (Kobayashi & Nomizu, 1963). To circumvent this restriction, Gemici et al. (2016) first project  M to RD, applythe usual flows there, and then project  RD back to N. How-ever, a naive application of this approach can be problematic when M or N are not diffeomorphic to  RD; in this case, the projection maps will necessarily contain singularities (i.e. points with non-invertible Jacobian), which may result in numerical instabilities during training. Another approach, proposed by Falorsi et al. (2019) for the case where N is a Lie group, is to apply the usual Euclidean flows on the Lie algebra of N (i.e. the tangent space at the identity element), and then use the exponential map to project the Lie algebra onto N. Since these maps are not from N to itself, they are not straightforward to compose. Also, the exponential map is not bijective and is computationally expensive in general. We discuss these issues in more detail in Section 3.

Our main contribution is to propose a new set of expressive normalizing flows for the case where M and N are compact and connected differentiable manifolds. Specifically, we construct flows on the 1-dimensional circle  S1, the D-dimensional torus  TD, the D-dimensional sphere  SD,and arbitrary products of these spaces. The proposed flows can be made arbitrarily flexible, and avoid the numerical instabilities of previous approaches. Our flows are applicable when we already know the manifold structure of the data, such as when the data is a set of angles or latent variables with a prescribed topology. We empirically demonstrate the proposed flows on synthetic inference problems designed to test the ability to model sharp and multi-modal densities.

We will begin by constructing expressive and numerically stable flows on the circle  S1. Then, we will use these flows as building blocks to construct flows on the torus  TDand the sphere  SD. The torus  TDcan be written as a Cartesian product of  D copies of S1, which will allow us to build flows on  TD autoregressively. The sphere  SD can be written as a transformation of the cylinder  SD−1 × [−1, 1], which will allow us to build flows on  SDrecursively, using flows on S1 and [−1, 1]as building blocks. By combining these constructions, we can build a wide range of compact connected manifolds of interest to fundamental and applied sciences.

2.1. Flows on the Circle S1

Depending on what is most convenient, we will sometimes view S1as embedded in R2, that is �(x1, x2) ∈ R2; x21 + x22 = 1�, or parameterize it by a coordinate  θ ∈ [0, 2π], identifying  0 and 2πas the same point. In this section, we describe how to construct a diffeomorphism f that maps the circle to itself.

Since 0 and  2πare identified as the same point, the transformation  f : [0, 2π] → [0, 2π]must satisfy appropriate boundary conditions to be a valid diffeomorphism on the circle. The following conditions are sufficient:

image

image

The first two conditions ensure that  0 and 2πare mapped to the same point on the circle. The third condition ensures that the transformation is strictly monotonic, and thus invertible. Finally, the fourth condition ensures that the Jacobians agree at  0 and 2π, thus the probability density is continuous.

A restriction in the above conditions is that 0 and  2πare fixed points. Nonetheless, this restriction can be easily overcome by composing a transformation f satisfying these conditions with a phase translation  θ �→ (θ + φ) mod 2π, where  φcan be a learnable parameter. Such a phase translation is volume-preserving, so it does not incur a volume correction in the calculation of the probability density.

Given a collection  {fi}i=1,...,Kof transformations satisfying the above conditions, we can combine them into a more complex transformation f that also meets these conditions. One such mechanism for combining transformations is function composition  f = fK ◦ · · · ◦ f1, which can easily be seen to satisfy Equations (3) to (6). Alternatively, we can combine transformations using convex combinations, as any convex combination defined by

image

also satisfies Equations (3) to (6). By alternating between function compositions and convex combinations, we can construct expressive flows on  S1 from simple ones.

Next, we describe three circle diffeomorphisms that by construction satisfy the above conditions: M¨obius transformations, circular splines, and non-compact projections.

2.1.1. M ¨OBIUS TRANSFORMATION

M¨obius transformations have previously been used to define distributions on the circle and sphere (see e.g. Kato et al., 2008; Wang, 2013; Kato & McCullagh, 2015). We will first describe the M¨obius transformation in the general case of the sphere  SD, and then show how to adapt it, when D = 1, to create expressive flows on the circle.

image

with norm strictly less than 1. For any point  z in SD, draw a straight line between  z and ω(as shown on the left). This line intersects  SD atexactly two distinct points. One is z. Call the other one  gω(z).

Definition 1. We define the M¨obius transformation  hω(z)of z with centre  ω to be −gω(z).

An explicit formula for  hωis given by

image

When  ω = 0, the transformation  hωis just the identity. In general, the variables z and  ωin Equation (8) are meant to be (D + 1)-dimensional real vectors. When D = 1, the equation also reads correctly if z and  ωare taken to be complex numbers. In this case, Equation (8) and the M¨obius transformations of the complex plane are related as explained in formula (6) of Kato & McCullagh (2015).

The transformation  hωexpands the part of the sphere that is close to  ω, and contracts the rest. Hence, it can transform a uniform base distribution into a unimodal smooth distribution parameterized by  ω. One property of the M¨obius transformation is that it does not become more expressive by composing various  hω. This is because the set of transformations  Rhω, where Rcan be any matrix in SO(D + 1), forms a group under function composition (see Theorem 2 of Kato & McCullagh, 2015). Since composition of two such transformations remains a member of the group, their expressivity is not increased.

In case of the circle (where D = 1), we would like to get expressive flows using a convex combination as in Equa- tion (7) of M¨obius transformations. The transformation hωdefines a map from angles to angles that satisfies Equa- tions (5) and (6), but it does not satisfy Equations (3) and (4). This can be fixed by adding a rotation after  hω. Let  Rωbe a rotation in  R2with centre (0, 0) that maps  hω(1, 0)back to (1, 0). We define  fω(θ)to be the polar angle, in  [0, 2π), of  Rω ◦ hω(z), and extend  fωby continuity to the whole range  [0, 2π]by setting  fω(2π) = 2π. This function sat-isfies Equations (3) to (6). We can then easily combine various  fωvia convex combinations and obtain expressive distributions on  S1—see Figure 7 in the appendix for an illustration. Unlike a single M¨obius transform, a convex combination of two or more M¨obius transforms is not analytically invertible, but it can be numerically inverted with precision  ϵusing bisection search with  O�log 1ϵ�iterations.

2.1.2. CIRCULAR SPLINES (CS)

Spline flows is a methodology for creating arbitrarily flexi-ble flow transformations from R to itself, first proposed by uller et al. (2019) and further developed by Durkan et al. (2019a;b). Here, we will show how to adapt the rational-quadratic spline flows of Durkan et al. (2019b) to satisfy the sufficient boundary conditions of circle diffeomorphisms.

Rational-quadratic spline flows define the transformation f : R → Rpiecewise as a combination of K segments, with each segment being a simple rational-quadratic function. Specifically, the transformation is parameterized by a set of K + 1 knot points  {xk, yk}k=0,...,Kand a set of K slopes {sk}k=1,...,K, such that  xk−1 < xk, yk−1 < yk and sk > 0for all k = 1, . . . , K. Then, in each interval  [xk−1, xk], thetransformation f is defined to be a rational quadratic:

image

where the coefficients  {αki, βki}i=0,1,2are chosen so that f is strictly monotonically increasing and  f(xk−1) = yk−1,f(xk) = yk, and  ∇f(θ)|θ=xk = skfor all k = 1 . . . , K (see Durkan et al., 2019b, for more details).

We can easily restrict f to be a diffeomorphism from  [0, 2π]to itself, by setting  x0 = y0 = 0 and xK = yK = 2π. Thisconstruction satisfies the first three sufficient conditions in Equations (3) to (5), and can be used to define probability densities on the closed interval  [0, 2π]. In addition, by setting  s1 = sK, we satisfy the fourth condition in Equa- tion (6), and hence we obtain a valid circle diffeomorphism which we refer to as a circular spline (CS).

Circular splines can be made arbitrarily flexible by increasing the number of segments K. Therefore, unlike M¨obius transformations, it is not necessary to combine them via convex combinations to increase their expressivity. An advantage of circular splines is that they can be inverted exactly, by first locating the corresponding segment (which can be done in O(log K) iterations using binary search since the segments are sorted), and then inverting the corresponding rational quadratic (which can be done analytically by solving a quadratic equation).

2.1.3. NON-COMPACT PROJECTION (NCP)

As discussed in the introduction, Gemici et al. (2016) project the manifold to  RD, apply the usual flows there, and then project  RDback to the manifold. Naively applying this method can be numerically unstable. However, here we show that, with some care, the method can be specialized to  S1in a numerically stable manner. Since this method involves projecting  S1 to the non-compact space R, we refer to it as non-compact projection (NCP).

We will use the projection  x : (0, 2π) → Rdefined by x(θ) = tan� θ2 − π2�. This projection maps the circle minus the point  θ = 0bijectively onto R. Applying the affine transformation  g(x) = αx + βin the non-compact space, where  α > 0and  βare learnable parameters, defines a corresponding flow on the circle, given by

image

with gradient

image

Even though the expression for f is not defined at the endpoints  0 and 2π, the expression for the gradient is. The transformation satisfies the appropriate boundary conditions,

image

Therefore, we can extend f to  [0, 2π]by continuity such that f(0) = 0 and  f(2π) = 2π, which yields a valid circle diffeomorphism. Although not immediately obvious, NCP flows and M¨obius transformations are intimately related, as explained in Appendix H (see also Downs & Mardia, 2002, Section 2.1).

The above boundary conditions are satisfied when the transformation g is affine, but they are not generally satisfied when g is an arbitrary diffeomorphism. This limits the type of flow we can put on the non-compact space. Therefore, instead of making g more expressive, we choose to increase the expressivity of the NCP flow by combining multiple transformations f via convex combinations.

Similarly to M¨obius transformations, NCP flows form a group. That’s because affine maps with positive slope on R form a group, and if  g1and  g2are affine maps and  fk =x ◦ gk ◦ x−1, then  f1 ◦ f2 = x ◦ g1 ◦ g2 ◦ x−1. Hence, the composition of two NCPs with parameters  (αk, βk), k =1, 2 is another NCP with parameters  (α1α2, β1 +α1β2). Socomposing two NCP flows would produce a new NCP flow, leading to no increase in expressivity.

A potential issue with NCP is that, near the endpoints 0 and 2π, evaluating f using Equation (10) directly is numerically unstable. To circumvent this numerical difficulty, we can use equivalent linearized expressions when  θis near the endpoints. For example, for  θ close to 0we can approximate f(θ) ≈ θ/α, whereas for  θclose to  2πwe have  f(θ) ≈2π + (θ − 2π)/α.

2.2. Generalization to the Torus TD

Having defined flows on the circle  S1, we can easily construct autoregressive flows on the D-dimensional torus TD ∼= (S1)D. Any density  p(θ1, . . . , θD)on  TDcan be decomposed via the chain rule of probability as

image

where each conditional  p(θi | θ1, . . . , θi−1)is a density on S1. Each conditional density can be implemented via a flow  fψi : S1 → S1, whose parameters  ψiare a function of (θ1, . . . , θi−1). Thus the joint transformation  f : TD → TDgiven by  f(θ1, . . . , θD) = (fψ1(θ1), . . . , fψD(θD))is an autoregressive flow on the torus. In the terminology of Papa- makarios et al. (2019, Section 3.1), an autoregressive flow on a torus is simply an autoregressive flow whose transformers are circle diffeomorphisms, i.e. obey the boundary conditions in Equations (3) to (6). As with any autoregressive flow, the Jacobian of f is triangular, therefore the Jacobian determinant in the density calculation in Equation (2) can be computed efficiently as the product of the diagonal terms.

The parameters  ψiof the i-th transformer are a function of  (θ1, . . . , θi−1)known as the i-th conditioner. In order to guarantee that the conditioners are periodic functions of each  θi, we can make  ψibe a function of (cos θ1, sin θ1, . . . , cos θi−1, sin θi−1)instead. In our experiments, we implemented the conditioners using coupling layers (Dinh et al., 2017). Implementations based on masking (Kingma et al., 2016; Papamakarios et al., 2017) are also possible.

More generally, autoregressive flows can be applied in the same way on any manifold that can be written as a Cartesian product of circles and intervals, such as the 2-dimensional cylinder. Flows on intervals can be constructed e.g. using regular (non-circular) splines as described in Section 2.1.2. Thus, by taking  fψito be either a circle diffeomorphism or an interval diffeomorphism as required, we can handle arbitrary products of circles and intervals.

2.3. Generalization to the Sphere SD

The M¨obius transformation (Section 2.1.1) can in principle be used to define flows on the sphere  SDfor  D ≥ 2, but, as noted, its expressivity does not increase by composition. Increasing the expressivity via convex combinations in D = 1 was only possible because we expressed diffeomorphisms on  S1as strictly increasing functions on [0, 2π]. This construction however does not readily extend to  D ≥ 2. Instead, we propose two alternative flow constructions for  SD: a recursive construction that uses cylindrical coordinates, and a construction based on the exponential map. In what follows,  SDwill be embedded in  RD+1as �(x1, . . . , xD+1) ∈ RD+1; �i x2i = 1�.

2.3.1. RECURSIVE CONSTRUCTION

In Section 2.1, we described three methods for building univariate flows on  S1. We also described how to build univariate flows on the interval  [−1, 1]using splines (Section 2.1.2). Using circle and interval flows as building blocks, we will construct a multivariate flow on  SD for D ≥ 2by recursing over the dimension of the sphere.

Our construction works as follows. First, we transform the sphere  SD into the cylinder  SD−1 × [−1, 1]using the map

image

image

Figure 1. Illustration of the recursive flow on the sphere  S2.

Then a two-stage autoregressive flow is applied on the cylinder. The ‘height’  r ∈ [−1, 1]is transformed first by a univariate flow  g : [−1, 1] → [−1, 1], and the ‘base’ z is transformed second by a conditional flow  f : SD−1 → SD−1whose parameters are a function of g(r), that is

image

Finally, the cylinder is transformed back to the sphere by

image

The flow  h : SD → SD is defined to be the composition h = Tc→s ◦ Tc→c ◦ Ts→c. When stacking multiple h (each with its own parameters), all the internal terms in  Ts→c ◦ Tc→scancel out, so it is more efficient to just not compute them. Only the first  Tc→sand last  Ts→cneed to be computed. Figure 1 illustrates this procedure on  S2.

The above recursion continues until we reach  S1, on whichwe can use the flows we have already described. Unrolling the recursion up to  S1 is equivalent to first transforming  SDinto  S1 × [−1, 1]D−1, then applying an autoregressive flow on  S1 × [−1, 1]D−1 as described in Section 2.2, and finally transforming  S1 × [−1, 1]D−1back to  SD. In practice, we can use any ordering of the variables in the autoregressive flow and not just the one implied by the recursion, and we can compose multiple autoregressive flows before transforming back to  SD. Figure 6in the appendix illustrates the model with the recursion fully unrolled.

At this point, it is important to examine carefully the transformations  Ts→cand  Tc→s. Maps from  SD−1 × [−1, 1]to  SDcan never be diffeomorphisms since the topology of these two spaces differ. Nevertheless,  Ts→c and Tc→s are in-vertible almost everywhere, in the sense that we can remove sets of measure  0 from SD−1 × [−1, 1] and SD, and the restriction of these maps to these subsets are diffeomorphisms. However, extra care must be taken to check that updates to the density caused by  Ts→cand  Tc→sare still numerically stable, and that the density is well-behaved everywhere on SD. In Appendix A.1, we prove that the update to a base density  π(z, r) due to Tc→s is

image

The update due to  Ts→cis simply the inverse of the above. Additionally, in Appendix A.1 we prove that the combined density update due to  h = Tc→s ◦ Tc→c ◦ Ts→cdoes not have any singularity whenever g is such that  g′(−1) > 0and  g′(1) > 0. Since we implement g with spline flows (Section 2.1.2), this condition is satisfied by construction, so the flow density is guaranteed to be finite.

Since the density update due to  Ts→c and Tc→scan be done analytically and the rest of the transformation is autoregressive, the overall density update is efficient to compute. For D = 2, the volume correction in Equation (15) is equal to 1, and therefore  Tc→sand  Ts→cpreserve infinitesimal volume elements. Finally, we observe that the well-known von Mises–Fisher distribution on  S2(Fisher, 1953) can be obtained by transforming a uniform base density with Equa- tions (12) to (14), where f is the identity and g has a simple functional form (for details, see Jakob, 2012).

2.3.2. EXPONENTIAL-MAP FLOWS

Ideally, we would like to specify flows directly on the manifold and avoid mapping between non-diffeomorphic sets. This motivates us to explore exponential-map flows, a mechanism for building flows on spheres proposed by Sei (2013).

The main idea consists of two steps. First, we define a cost-convex scalar field  φ(x)on  SD—for the definition of ‘cost-convex’, see (Sei, 2013, Formula 1). Second, we construct a flow using the exponential map of the gradient ∇φ(x) ∈ TxSD, where TxSD is the tangent space of  SD atx. The exponential map  expx : TxM → Mon a Riemannian manifold M is defined as the terminal point  γ(1)of a geodesic  γ : [0, 1] → Mthat passes through  γ(0) = xwith velocity  ˙γ(0) = v (Kobayashi & Nomizu, 1963). For a general manifold it is hard to compute exponential maps, but for the sphere  SD the exponential map is straightforward:

image

where  ∥ · ∥is the Euclidean norm.

Parameterizing a general cost-convex scalar field  φ(x)remains non-trivial on  SD. To solve this, Sei (2013) proposes to construct  φ(x)via a convex combination of simple scalar functions  φi : [0, π] → Rthat satisfy  φ′i(0) = φ′i(π) = 0and  φ′′i > −1. Specifically,  φ(x)is given by

image

where  d(x, µi)is the geodesic distance between  x ∈ SDand  µi ∈ SD, αi ≥ 0and �i αi ≤ 1. The exponential map of the gradient field of  φ(x)is guaranteed to be a diffeomorphism from  SD to SD (Sei, 2013, Lemma 4).

We found that it is challenging to learn concentrated multi-modal densities on  SDusing the polynomial and highfrequency scalar fields proposed by Sei (2013). To address this limitation, we introduce a scalar field which is inspired

by radial flows (Tabak & Turner, 2013; Rezende & Mo- hamed, 2015) and is given by

image

where  αi ≥ 0, �i αi ≤ 1, µi ∈ SDand  βi > 0. It is straightforward to see that the functions

image

satisfy the required conditions, therefore the map

image

is a diffeomorphism from  SD to SD.

As shown in Appendix A, the density update due to f is

image

where J(x) is the Jacobian of f at x when f is seen as a map from  RD+1to  RD+1, and the columns of E(x) form an orthonormal basis of vectors tangent to the sphere at x. Unlike the recursive flow which admits an efficient density update, computing the above density update costs  O�D3�, so the exponential-map flow is only practical for small D.

The study of distributions on objects such as angles, axes and directions has a long history, and is known as directional statistics (Mardia & Jupp, 2009). According to the taxonomy of Navarro et al. (2017), directional statistics traditionally uses three approaches for defining distributions on tori and spheres: wrapping, projecting, and conditioning.

Wrapping is typically used to define distributions on  S1. Theidea is to start with a distribution p(x) on R, and wrap it around  S1by writing  x = θ + 2πkand marginalizing out k. A common example is the wrapped Gaussian (see e.g. Jona-Lasinio et al., 2012, Section 2.1). A disadvantage of wrapped distributions is that, unless p(x) has bounded support, they require an infinite sum over k, which is generally not analytically tractable.

Projecting is an alternative approach where the manifold is embedded into  RMwith M > D, a distribution p(x) is defined on  RM, and the probability mass is projected onto the manifold. For example, if the manifold is the sphere  SDembedded in  RD+1, the probability mass in  RD+1can be projected onto  SDby writing  x = ruxwhere  r = ∥x∥and uxis a unit vector in the direction of x, and then integrating out r. A common example is the projected Gaussian on S1(see e.g. Wang & Gelfand, 2013). A difficulty with this approach is that it requires computing an integral over r, which may not be generally tractable.

Conditioning also begins by embedding the manifold into RM and defining a distribution  p(x) on RM (which here can be unnormalized). Then, the distribution on the manifold M is obtained by conditioning p(x) on  x ∈ M, that is, restricting p(x) on M and re-normalizing. Common examples when  M = S1are the von Mises (Wang, 2013, Section 1.3.3) and generalized von Mises distributions (Gatto & Jammalamadaka, 2007), which are obtained by taking an isotropic or arbitrary Gaussian (respectively) in  R2and conditioning on  ∥x∥ = 1. For the case of  M = SD, a com-mon example is the von Mises–Fisher distribution (Fisher, 1953), which is obtained by conditioning an isotropic Gaussian in  RDon  ∥x∥ = 1, and reduces to the von Mises for D = 1. The von Mises–Fisher distribution can also be extended to the Stiefel manifold of sets of N orthonormal D-dimensional vectors (Downs, 1972), and it can be generalized to the more flexible Kent distribution, also known as Fisher–Bingham, on  SD (Kent, 1982). Finally, distributions on  TD can be obtained by defining  p(x) in R2D and condi-tioning on  x2i + x2D+i = 1for all i = 1, . . . , D. Examples include the multivariate von Mises (Mardia et al., 2008) and multivariate generalized von Mises distributions (Navarro et al., 2017), which correspond to p(x) being Gaussian with constrained or arbitrary covariance matrix respectively. A drawback of conditioning is that computing the normalizing constant of the resulting distribution requires integrating p(x) on M, which is not generally tractable.

In general, the above three strategies lead to distributions with tractable density evaluation and sampling algorithms only in special cases, which typically yield simple distributions with limited flexibility. One approach for increasing the flexibility of such simple distributions is via combining them into mixtures (see e.g. Peel et al., 2001; Mardia et al., 2007). Such mixtures could be used as base distributions for the flows on tori and spheres that we present in this work. However, unlike flows whose expressivity increases via composition, mixtures generally require a large number of components to represent sharp and complex distributions, and can be harder to fit in practice.

A general method for defining distributions on Lie groups, which are groups with manifold structure, was proposed by Falorsi et al. (2019). The tangent space of a D-dimensional Lie group at the identity element is isomorphic to  RDand is known as the Lie algebra of the group. The method is to define a distribution p(x) on the Lie algebra (e.g. using standard Euclidean flows), and then map the Lie algebra onto the group using the exponential map. If the Lie group is a compact connected manifold, the exponential map is surjective but not injective, and the method can be thought of as a generalization of wrapping. When the Lie group is U(1) ∼= S1, we recover wrapped distributions on  S1, since inthis case the exponential map is  x �→ (cos x, sin x). When the Lie group is  TD, the exponential map wraps  RD aroundthe torus along each dimension. This involves an infinite sum, but even if the sum is restricted to be finite (e.g. by bounding the support of p(x)), the number of terms to be summed scales exponentially with D. In addition to the above, Falorsi et al. (2019) provide concrete examples for SO(3) and SE(3). The main drawbacks of this method is that the exponential map cannot be easily composed with other maps (since it is not generally bijective), and can be expensive to compute in high-dimensions.

Finally, a general method for defining normalizing flows on a Riemannian manifold M was proposed by Gemici et al. (2016). This method relies on the existence of two maps: an embedding  g : M → RMand a coordinate chart  φ :M → RD, where  D ≤ M. The idea is to apply the usual Euclidean flows on  RD, and then transform the density onto the (embedded) manifold via the map  g ◦ φ−1. The density

update associated with this transformation is√det J⊤J, where J is the Jacobian of  g◦φ−1 (as shown in Appendix A). This method leads to composable transformations, but is better suited for manifolds that are homeomorphic to  RD(so that  φis well-defined). Compact manifolds such as tori and spheres are not homeomorphic to  RD, hence  φcannot be defined everywhere on M. In such cases, the transformed density may not be well-behaved, which can create numerical instabilities during training.

For M¨obius transforms, we reparameterized the centre  ω, where  ∥ω∥ < 1, in terms of an unconstrained parameter ω′ ∈ R2via  ω = 0.991+∥ω′∥ω′. The constant 0.99 is used to prevent  ∥ω∥from being too close to 1. For circular splines, we prevent slopes from becoming smaller than  10−3by adding  10−3to the output of a softplus of unconstrained parameters. For autoregressive flows, the parameters of each conditional transformation are computed by a ReLU-MLP with layer sizes  [Ni, 64, 64, No], where Niis the number of inputs and  Nothe number of parameters for each transform. All flow models use uniform base distributions.

We evaluate and compare our proposed flows based on their ability to learn sharp, correlated and multi-modal target densities. We used targets  p(x) = e−βu(x)/Zwith inverse temperature  βand normalizer Z. We varied  βto create targets with different degrees of concentration/sharpness. The models  qηwere trained by minimizing the KL divergence

image

For an additional evaluation of how well the flows match the target, we used  S = 20,000 samples xs from the trained

image

Figure 2. Comparing flows on  T2 using ESS, for the targets in Figure 3 and various inverse temperatures  β. A larger value of  βmakes the target densities more concentrated and therefore harder to learn. Numbers in square brackets indicate the number of components K used for each transform. All values are averages across 10 runs with different neural-network initialization. See also Fig- ure 8 (appendix) for KL values and further comparisons. Top row: Unimodal target. Middle row: Multi-modal target with 3 mixture components. Bottom row: Correlated target.

flows to estimate the normalizer via importance sampling:

image

where  ws = e−βu(xs)/qη(xs). The effective sample size, (Arnaud et al., 2001; Liu, 2008), can be estimated by

image

Higher ESS indicates that the flow matches the target better (when reliably estimated). We report ESS as a percentage of the actual sample size.

All models were optimized using Adam (Kingma & Ba, 2015) with learning rate  2 × 10−4, 20,000iterations, and batch size 256. The reported error bars are the standard deviation of the average, computed from independent replicas of each experiment with identical hyper-parameters, but with different initialization of the neural network weights. All shown model densities are kernel density estimates using 20,000 samples and Gaussian kernel with bandwidth 0.2.

Figures 2 and 3 show results for  T2. The target densities are: a unimodal von Mises, a multi-modal mixture of von

image

Figure 3. Learned densities on  T2 using NCP, M¨obius and CS flows. Densities shown on the torus are from NCP.

image

Table 1. Comparing baseline and proposed flows on  S2 using KLand ESS. The target density is the mixture of 4 modes shown in Figure 4. We compare recursive M¨obius-spline flow (MS), exponential-map polynomial flow (EMP) and exponential-map sum-of-radial flow (EMSRE). Brackets show error bars on the KL from 3 replicas of each experiment.  NTis the number of stacked transformations for each flow;  Kmis the number of centres used in M¨obius;  Ksis the number of segments in the spline flow; K is the number of radial components in the radial exponential-map flow. The polynomial scalar field is shown in Appendix E.

Mises, and a correlated von Mises (their precise definitions are in Table 2 of the appendix). The results demonstrate that we can reliably learn multi-modal and correlated densities with different degrees of concentration. Among the circle flows, NCP and M¨obius performed the best, whereas CS performed less well for high  β. An additional experiment is shown in Appendix I, where flows on  T6are used to learn the posterior density over joint angles of a robot arm.

Table 1 shows results for  S2, where we compare the recursive flow to the exponential-map flow with polynomial and radial scalar fields. The recursive flow uses M¨obius for the circle and splines for the interval. For the exponential-map flow, the polynomial scalar field was proposed by Sei (2013) and is shown in Appendix E, whereas the radial scalar field is the new one we propose in Equation (18). The target density is the mixture of 4 modes shown in Figure 4. We demonstrate substantial improvement relative to the polynomial scalar field, but we find that the exponential-map

image

Figure 4. Learned multi-modal density on  S2 using exponential-map flows, using the Mollweide projection for visualization. The model is a composition of 24 exponential-map transforms, using the radial scalar field with 1 component.

image

Figure 5. Learned multi-modal density on SU(2) ∼= S3 using therecursive flow. Each column shows an  S2 slice of the  S3 densityalong a fixed axis using the Mollweide projection. Top row: target density. Bottom row: learned density. We used a M¨obius transform with  Km = 32for the circle, and spline transforms with Ks = 64for the two intervals (ESS = 84%, KL = 0.14).

flow with either scalar field is not yet competitive with the recursive flow. Figure 5 shows an example of learning a density on SU(2) ∼= S3using the recursive flow. Finally, Appendix J shows an example of training a recursive flow (using splines for both the circle and the interval) on data sampled form a ‘map of the world’ density on  S2.

This work shows how to construct flexible normalizing flows on tori and spheres of any dimension in a numerically stable manner. Unlike many of the distributions traditionally used in directional statistics, the proposed flows can be made arbitrarily flexible, but have tractable and exact density evaluation and sampling algorithms. We conclude with a comparison of the proposed models, a discussion of their limitations, and some preliminary thoughts on how to extend flows to other manifolds of interest to fundamental physics.

5.1. Comparison, Scope and Limitations

Among the flows on the circle, M¨obius and NCP performed the best, with CS performing less well for highly concentrated target densities. However, increasing the expressivity of M¨obius and NCP required convex combinations, whereas CS can be made more expressive by adding more spline segments. As a result, CS is the cheapest to invert (it can be done analytically), whereas M¨obius and NCP (with more than one component) require a root-finding algorithm such as bisection search. Therefore, in practice it may be preferable to use CS if both density evaluation and sampling are required, and use M¨obius or NCP otherwise.

On  SD, the recursive flow performed better than the exponential-map flow. In addition, the recursive flow scales better to high dimensions since its density can be computed efficiently, whereas the density of the exponential-map flow has a computational cost of  O�D3�. The theoretical advantage of the exponential-map flow is that it is intrinsic to the sphere, but this advantage did not result in a practical benefit in our experiments.

5.2. Towards Normalizing Flows on SU(D) and U(D)

The unitary Lie groups U(1), SU(2) and SU(3) are of particular interest to fundamental physics, because the symmetry groups of particle interactions are constructed from them (Woit, 2017). We have shown that we can construct expressive flows on U(1) ∼= S1 and SU(2) ∼= S3.

Our recursive construction for  SD provides a starting point for flows on SU(D) for D ≥ 3 and U(D) for D ≥ 2, via re-cursively building SU(D) from SU(D − 1) and U(1)2D−1,and U(D) from SU(D) and extra angles. These decompositions provide coordinate systems on SU(D) and U(D) in terms of Euler angles as shown by Tilma & Sudarshan (2002; 2004). Working with explicit coordinate systems on SU(D) and U(D) is more involved than on spheres; figur-ing out the necessary details such as appropriate boundary conditions for each coordinate/angle is a direction we plan to explore in the future in order to build flows on a wider variety of compact connected manifolds.

We thank Heiko Strathmann and Alex Botev for discussions and feedback. PES is partially supported by the National Science Foundation under CAREER Award 1841699 and PES and GK are partially supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under grant Contract Number de-sc0011090. KC is supported by the National Science Foundation under the awards ACI-1450310, OAC-1836650, and OAC-1841471 and by the Moore-Sloan data science environment at NYU.

Arnaud, D., de Freitas, N., and Gordon, N. Sequential Monte Carlo methods in practice. Information Science and Statistics. Springer New York, 2001.

Boomsma, W., Mardia, K. V., Taylor, C. C., Ferkinghoff- Borg, J., Krogh, A., and Hamelryck, T. A generative, probabilistic model of local protein structure. Proceedings of the National Academy of Sciences, 105(26):8932– 8937, 2008.

Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tom- czak, J. M. Hyperspherical variational auto-encoders. Conference on Uncertainty in Artificial Intelligence,

2018.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density esti- mation using Real NVP. International Conference on Learning Representations, 2017.

Downs, T. D. Orientation statistics. Biometrika, 59(3): 665–676, 1972.

Downs, T. D. and Mardia, K. V. Circular regression. Biometrika, 89(3):683–698, 2002.

Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Cubic-spline flows. ICML workshop on Invertible Neural Networks and Normalizing Flows, 2019a.

Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. Advances in Neural Information Processing Systems, 2019b.

Falorsi, L., de Haan, P., Davidson, T. R., De Cao, N., Weiler, M., Forr´e, P., and Cohen, T. S. Explorations in homeomorphic variational auto-encoding. ICML workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.

Falorsi, L., de Haan, P., Davidson, T. R., and Forr´e, P. Reparameterizing distributions on Lie groups. International Conference on Artificial Intelligence and Statistics, 2019.

Feiten, W., Lang, M., and Hirche, S. Rigid motion estima- tion using mixtures of projected Gaussians. International Conference on Information Fusion, 2013.

Fisher, R. Dispersion on a sphere. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 217(1130):295–305, 1953.

Gatto, R. and Jammalamadaka, S. R. The generalized von Mises distribution. Statistical Methodology, 4(3):341– 353, 2007.

Gemici, M. C., Rezende, D. J., and Mohamed, S. Normal- izing flows on Riemannian manifolds. arXiv preprint arXiv:1611.02304, 2016.

Hamelryck, T., Kent, J. T., and Krogh, A. Sampling realistic protein conformations using local structural bias. PLoS Computational Biology, 2(9), 2006.

Jakob, W. Numerically stable sampling of the von Mises Fisher distribution on  S2(and other tricks). Technical report, Interactive Geometry Lab, ETH Z¨urich, 2012.

Jona-Lasinio, G., Gelfand, A., and Jona-Lasinio, M. Spatial analysis of wave direction data using wrapped Gaussian processes. The Annals of Applied Statistics, 6(4):1478– 1498, 2012.

Kato, S. and McCullagh, P. M¨obius transformation

and a Cauchy family on the sphere. arXiv preprint arXiv:1510.07679, 2015.

Kato, S., Shimizu, K., and Shieh, G. S. A circular–circular regression model. Statistica Sinica, 18:633–645, 2008.

Kent, J. T. The Fisher–Bingham distribution on the sphere. Journal of the Royal Statistical Society. Series B (Methodological), 44(1):71–80, 1982.

Kingma, D. P. and Ba, J. Adam: A method for stochas- tic optimization. International Conference for Learning Representations, 2015.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems, 2016.

Kobayashi, S. and Nomizu, K. Foundations of differential geometry, volume 1. Interscience Publishers, 1963.

Liu, J. S. Monte Carlo strategies in scientific computing. Springer Science & Business Media, 2008.

Mardia, K. V. and Jupp, P. E. Directional statistics. Wiley Series in Probability and Statistics. John Wiley & Sons, 2009.

Mardia, K. V., Taylor, C. C., and Subramaniam, G. K. Pro- tein bioinformatics and mixtures of bivariate von Mises distributions for angular data. Biometrics, 63(2):505–512, 2007.

Mardia, K. V., Hughes, G., Taylor, C. C., and Singh, H. A multivariate von Mises distribution with applications to bioinformatics. The Canadian Journal of Statistics, 36 (1):99–109, 2008.

Mathieu, E., Le Lan, C., Maddison, C. J., Tomioka, R., and Teh, Y. W. Continuous hierarchical representations with Poincar´e variational auto-encoders. Advances in Neural Information Processing Systems, 2019.

M¨uller, T., McWilliams, B., Rousselle, F., Gross, M., and Nov´ak, J. Neural importance sampling. ACM Transactions on Graphics, 38(5):1–19, 2019.

Navarro, A. K. W., Frellsen, J., and Turner, R. E. The multi- variate generalised von Mises distribution: Inference and applications. AAAI Conference on Artificial Intelligence, 2017.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems, 2017.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. arXiv preprint

arXiv:1912.02762, 2019.

Peel, D., Whiten, W. J., and McLachlan, G. J. Fitting mix- tures of Kent distributions to aid in joint set identification. Journal of the American Statistical Association, 96(453): 56–63, 2001.

Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. International Conference on Machine Learning, 2015.

Sei, T. A Jacobian inequality for gradient maps on the sphere and its application to directional statistics. Communications in Statistics—Theory and Methods, 42(14): 2525–2542, 2013.

Senanayake, R. and Ramos, F. Directional grid maps: mod- eling multimodal angular uncertainty in dynamic environments. IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018.

Shapovalov, M. V. and Dunbrack Jr, R. L. A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions. Structure, 19(6):844–858, 2011.

Tabak, E. G. and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.

Tilma, T. and Sudarshan, E. Generalized Euler angle parametrization for SU (N). Journal of Physics A: Mathematical and General, 35(48):10467, 2002.

Tilma, T. and Sudarshan, E. Generalized Euler angle param- eterization for U (N) with applications to SU (N) coset volume measures. Journal of Geometry and Physics, 52 (3):263–283, 2004.

Wang, F. and Gelfand, A. E. Directional data analysis un- der the general projected normal distribution. Statistical Methodology, 10(1):113–127, 2013.

Wang, M. Extensions of probability distributions on torus, cylinder and disc. PhD thesis, Graduate School of Science and Technology, Keio University, 2013.

Wang, P. Z. and Wang, W. Y. Riemannian normalizing flow on variational Wasserstein autoencoder for text modeling. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.

Wang, Z., Cheng, S., Li, Y., Zhu, J., and Zhang, B. A Wasserstein minimum velocity approach to learning unnormalized models. Symposium on Advances in Approximate Bayesian Inference, 2019.

Woit, P. Quantum theory, groups and representations: An introduction. Springer, 2017.

In this section, we explain how to update the density of a distribution transformed from one Riemannian manifold to another by a smooth map. We only consider the case where both manifolds are sub-manifolds of Euclidean spaces.

Let M and N be D-dimensional manifolds embedded into Euclidean spaces  Rmand  Rnrespectively. For example, M and N could be  SDembedded in  RD+1as in Sec- tion 2.3. Both manifolds inherit a Riemannian metric from their embedding spaces. Let T be a smooth injective map T : M → N. We will assume that T can be extended to a smooth map between open neighbourhoods of the embedding spaces that contain M and N, and that we have chosen such an extension. For example, the exponential-map flow in Equation (19) can be written using the coordinates of the embedding space  RD+1, and can thus be extended to open neighbourhoods of the embedding spaces as desired.

In what follows, we will use the fact that if  u1, . . . , uDare vectors in  Rn, then the volume of the parallelepiped with sides  u1, . . . , uDis�det(U ⊤U), where U is the  n × Dmatrix with column vectors  u1, . . . , uD. If u1, . . . , uD forman orthonormal system, this volume is 1.

Let  π : M → R+be a density on M. This defines a distribution on M and we can use T to transform it into a distribution on N. Let  p : N → R+be the density of the transformed distribution. We are interested in computing p assuming we know  π. Let x be a point on M, and e1, . . . , eDbe an orthonormal basis of the tangent space TxM. Define E to be the  m × Dmatrix with i-th column vector  ei. Let J be the n×mJacobian of T, where T is seen as a map between open sets in  Rmand  Rn. The tangent map of T at x transforms each  eito  Jei, and the matrix that collects all transformed vectors in its columns is JE. Hence, the volume of the parallelepiped with sides the trans-

formed vectors is�det((JE)⊤JE) =�det(E⊤J⊤JE).Therefore, the density p is given by

image

In the special case where  M = N = RD and m = n = D,the matrix E is an orthogonal matrix, and the above reduces to the familiar density update in Equation (2).

A.1. The case of  Tc→s : SD−1 × [−1, 1] → SD

In this section, we specialize to  M = SD−1 × (−1, 1) andN = SD with D ≥ 2. In particular, we will prove:

Proposition 1. Let  πbe a density  π : SD−1 × (−1, 1) →R+. Let  p : SD → R+be the density of the transformed distribution under  Tc→s. Then:

image

Proof. The sphere  SD−1 is embedded in  RD. This gives us an embedding of  M in RD+1. The map Tc→s, introduced in Equation (14), is easily extended to a map  RD × (−1, 1) →RD+1 using the same formula  Tc→s(z, r) = (√1 − r2z, r).Its Jacobian is an upper triangular D + 1 by D + 1 matrix

image

We will use a symmetry argument to simplify the computation of the determinant in Equation (24). Let G be a rotation of  RD+1 that leaves the last coordinate invariant.1 Note that for any point  x ∈ RD+1, we have Tc→s(Gx) = GTc→s(x).This means that the Jacobian J transforms as a function of x as  J(Gx) = GJ(x)G⊤. Note also that if x is in M, and E(x) is a matrix where the column vectors form a basis of the tangent space at x, then Gx is also in M, and GE(x) is a matrix where the column vectors form a basis of the tangent space at Gx. So we can choose E(Gx) = GE(x). With that choice

image

Since for any  x ∈ M, we can always choose G such that Gx is of the form  (√1 − r2, 0, . . . , 0, r), we can restrict ourselves to this case. For such a choice, the Jacobian simplifies to

image

For E, we can simply choose the D + 1 by D matrix made by removing the first column from the identity matrix. Then JE is equal to J with the first column removed:

image

The product  (JE)⊤JEis simply a diagonal matrix of size D with diagonal  (1 − r2, . . . , 1 − r2, 11−r2 ). Taking the determinant and applying Equation (24) concludes the proof.

image

At first, Proposition 1 might seem worrying since the density ratio in that proposition vanishes when  r is −1 or 1. So, asr approaches the boundary of the interval  [−1, 1], it seems that the correction term to the density will tend to infinity and lead to numerical instability.

What saves us is that we do not use  Tc→son its own, and instead combine it with a particular flow transformation on SD−1 × [−1, 1]and the inverse  Ts→c, as shown in Equa- tions (12) to (14). In these formulas, the map g is a spline on the interval  [−1, 1]which maps  −1to  −1, 1to 1, and has strictly positive slopes  g′(−1) and g′(1). Looking only at  −1 (the case 1can be similarly dealt with), this means

image

As  ϵgoes to 0, the density corrections coming from  Tc→sand  Ts→ccombine to

image

as  ϵgoes to 0. In particular, the terms that tend to infinity cancel each other, and the flow is well-behaved. When implementing the flow, numerical stability is achieved by not adding the terms that cancel each other. Finally, we note a subtle point about what we proved: the sequence of transformations  Tc→s ◦ Tc→c ◦ Ts→cwill transform a distribution with finite density into another distribution with finite density, but we do not guarantee that the resulting density will be continuous.

In Figure 6, we provide an illustration of the recursive construction in Equations (12) to (14), showing the specific wiring order of the conditional maps inside the flow. This order is the one implied by the recursion. In general, any other order can be used, or a composition of autoregressive flows with multiple orders.

To illustrate the kind of densities we get on  S1using the M¨obius flows, we show a few random examples in Figure 7.

Another family of circle transformations that we considered are Fourier transformations, defined by

image

image

Figure 6. Detailed illustration of the recursive flow on the sphere SD showing the explicit wiring of the conditional flows. The sphere  SD is recursively transformed to the cylinder  S1 ×[−1, 1]D−1, then an autoregressive flow is applied to the cylinder, and finally the cylinder is transformed back to the sphere.

where  µ = �i αi sin(φi), wi ∈ Z, φi ∈ [0, 2π]and �i |αi| ≤ 1. The integers  wiare fixed frequencies in the Fourier basis.

We found empirically that this family of transformations is not competitive with the other transformations considered in this paper, especially for highly concentrated densities as shown in Figure 8.

The polynomial exponential map of Sei (2013) is the exponential-map flow built using the scalar field

image

where  x ∈ SDin the coordinates of the embedding space RD+1. The parameters  µ and Amust satisfy the constraint ∥µ∥1 + ∥A∥1 ≤ 1 where ∥·∥1is the elementwise  ℓ1 norm.

For the experiments on the torus  T2, we used targets built from densities in the von Mises family as shown in Table 2.

On the sphere  S2, the target was a mixture of the form

image

where  µ1 = (0.7, 1.5), µ2 = (−1, 1), µ3 = (0.6, 0.5), µ4 = (−0.7, 4), Ts→emaps from spherical to Euclidean

image

Figure 7. Probability density functions of convex combinations of  15 M¨obius transformations applied to a uniform base distribution on the circle  S1. Each of these distributions required  30 = 15 × 2parameters.

image

Table 2. Target densities used for experiments on  T2.

image

Figure 8. Same as Figure 2, with KL and values for the Fourier transforms added. For the Fourier models, the numbers between brackets represent used frequencies, and a number before the bracket means each frequency was repeated. For example, Fourier3[1 − 4] is aFourier model with 12 frequencies: 3 frequencies of k for each k = 1, . . . , 4.

coordinates, and  x ∈ R3 is a point on the embedded sphere in Euclidean coordinates.

On SU(2) ∼= S3, the target was a mixture of the same form where  µ1 = (1.7, −1.5, 2.3), µ2 = (−3.0, 1.0, 3.0), µ3 = (0.6, −2.6, 4.5), µ4 = (−2.5, 3.0, 5.0), and  x ∈ R4is a point on the embedded sphere in Euclidean coordinates.

The recursive formulas shown in Equations (12) to (14) require choosing a sequence of axes in order to construct the cylindrical coordinate system. This may introduce artifacts to the density related to this choice of axes. To test if this results in numerical problems, we compare the flow from Equations (12) to (14) on a target density that forms a non-axis-aligned ring against a composition of the same flow with a learned rotation.

The results of this experiment are shown in Figure 9. We compared both large (Ks = 32, Km = 12) and small (Ks = 3, Km = 3) versions of the auto-regressive M¨obiusSpline flow and observed no significant differences between the two models on  S2.

More experiments would be necessary to investigate this potential effect in higher dimensions.

For a general M¨obius transformation

image

where  a, b, c, d, z ∈ C, to define a diffeomorphism on  S1 itmust be constrained to be of the form

image

This form ensures that  h(z)¯h(z) = 1 if z¯z = 1and has two real-valued free parameters  ℜ(a) and ℑ(a).

In what follows we show that for the choice  ℑ(a) = 0and  ℜ(a) = − 1−α1+α, the transformation h(z) is equivalent to an NCP transform  w = 2 arctan�α tan� θ2�+ β�with scale parameter  αand offset parameter  β = 0(assuming w, θ ∈ (−π, π)). If we define  θvia  z = eiθ, the goal is to show that w defined via

image

follows the NCP transformation rule

image

We begin by expanding Equation (31) in terms of more basic trigonometric quantities,

image

In order to isolate w, only the numerator v = 2�cos� θ2�+ iα sin� θ2��2of the expression above matters as we are only interested in ratios of the imaginary and real parts of this expression,  tan(w) = ℑ(v)ℜ(v). The numerator can be expanded as

image

from which we conclude,

image

Using the trigonometric formula  tan(2x) = 2 tan(x)1−tan(x)2, we arrive at the final result

image

As a concrete application of flows on tori, we consider the problem of approximating the posterior density over joint angles  θ1,...,6of a 6-link 2D robot arm, given (soft) constraints on the position of the tip of the arm. The possible configurations of this arm are points in  T6. The position  rkof a joint k = 1, . . . , 6 of the robot arm is given by

image

where  r0 = (0, 0)is the position where the arm is affixed, lk = 0.2is the length of the k-th link, and  θkis the angle of the k-th link in a local reference frame. The constraint on the position of the tip of the arm,  r6, is expressed in the form of a Gaussian-mixture likelihood  p(r6 | θ1,...,6)with two components. The prior  p(θ1,...,6)is taken to be a uniform distribution on  T6. The experimental results are illustrated in Figure 10.

In most of the experiments shown on this paper, we trained the models to fit a target density known up to a normalization constant (i.e. an inference problem). In this experiment we train our flow directly on data samples instead.

Training a flow-based model from data samples via maximum likelihood requires an explicit computation of the inverse map as shown in Equation (2). To demonstrate this is feasible with data coming from a non-trivial target density on the sphere  S2(i.e. that would require a large number of mixture components from simpler densities such as von Mises), we created a dataset of samples on the sphere coming from a density shaped as Earth’s continental map as shown in Figure 11 (left).

We trained a flow built from stacking two autoregressive flows. Each flow in the stack used circular splines and standard splines on the interval. The model was trained to maximize the likelihood of the dataset for 100,000 training steps. Both splines used  Ks = 80segments. The neural networks producing the spline parameters are the same as for the other experiments. In Figure 11 (middle) we show samples from the learned model overlaid on Earth’s map and in Figure 11 (right) we show a heat map of the learned density.

image

Figure 9. Learning a non-axis-aligned density on  S2 usingEquations (12) to (14) with and without composing with a learnable rotation. We compare M¨obius-spline flow (MS) (Ks = 32, Km = 12), learnable rotation composed with MS (Rot  ◦MS), small MS (Ks = 3,Km = 3) (SMS) and learnable rotation composed with SMS (Rot  ◦SMS). We observed no substantial differences between these models, suggesting that the particular choice of axis inside the flow has no impact on performance.

image

Figure 10. Learning the posterior density over joint angles of a 6-link 2D robot arm. We used an autoregressive M¨obius flow on the torus T6 to approximate the posterior density of joint angles resulting in a Gaussian mixture density for the tip of the robot arm. Left: The heat map shows the target density for the tip of the robot arm, a Gaussian mixture with two modes with centres at  (−0.5, 0.5) and (0.6, −0.1).White paths show arm configurations sampled from the learned model in angle space converted to position space. Right: Density of the tips of the robot arm using samples from the learned model.

image

Figure 11. Learning a complex density from data samples using autoregressive spline flows. Left: Target density from which i.i.d. data samples were generated. Middle: Model samples overlaid on target density; Right: Heat map of the learned density.


Designed for Accessibility and to further Open Science