The orientation of our body and shoulder-line changes continuously as we walk. When our gait is regular, these changes are nearly periodic and follow the swinging trend of our trajectories as we balance our weight between our feet [1]. At times, motion direction and body orientation remain temporarily decoupled. This happens, for instance, when we sidestep or in proximity of turns and distractions.
Shoulder-line yawing is not just a mechanical reflection of the walking action, it rather becomes an essential dynamic ingredient as our motion gets geometrically constrained, e.g. by a dense crowd or by a narrow environment. In both cases, as we need to make our way to our destination, we, consciously or unconsciously, rotate our bodies sideways to minimize collisions or maintain comfort distances with other pedestrians or the environment.
The dynamics of shoulder-line rotation has been scarcely investigated from a quantitative viewpoint. The data currently available is extremely limited and has been acquired via few laboratory experiments (e.g. [6, 7]). Such scarceness of accurate data hinders the capability of statistic characterizations of the rotation dynamics beyond the estimation of average properties, to include, e.g., fluctuations and rare events. We believe that this is connected with the inherent technical complexity of measuring body yawing accurately and in real-life conditions. Real-life measurements campaigns, in fact, need to rely only on non-intrusive imaging data (or alike) of pedestrians, and cannot be supported by ad hoc wearable sensors, such as accelerometers [7]. Indeed, even the accurate estimation of the position of an individual in real-life, a more “macroscopic” or “coarser-scale” degree of freedom than orientation, is a recognized technical challenge [8]. Since few years, overhead depth-sensing [9, 10, 4, 5], as used in this work, has been successfully employed to perform accurate pedestrian localization and prolonged tracking campaigns (see example in Fig. 1 and [3]). Overhead depth data, not only allows privacy respectful data acquisition, but enables also accurate position measurements even in high-density conditions (for a highly-accurate algorithm leveraging on machine learning-based analyses see, e.g., [11]).
In this paper we propose a novel method to measure -in real-life conditions and with very high accuracy- the shoulder rotation of walking pedestrians. Our measurement method is based on a deep Convolutional Neural
Figure 1: We measure and investigate the dynamics of shoulder orientation for walking pedestrians in real-life scenarios. Our measurements are based on raw data acquired via grids of overhead depth sensors, such as Microsoft Kinect[2]. In (a,b) we report, respectively, a front and an aerial view of a data acquisition setup (similar to that in [3]). The sensors, of view the typical view cone is reported in (a), are represented in (b) as thick segments. In overhead depth images (c), the pixel value, here colorized in gray, represents the distance between each pixel and the camera plane: brighter shades are far from the sensor and, the darker the pixel color, the closer the pixel is to the sensor. Heads are, therefore, in darker shade than the floor. Through localization and tracking algorithms from [4, 5], we extract imagelets centered on individual pedestrians (cf. imagelets annotated with ground truth in Fig. 2) for which we estimate orientations via the method introduced here.
Figure 2: (a,c) Pedestrian trajectories (purple) superimposed to depth snapshots (gray). Orientation estimates and local velocities (directions of motion) are reported, respectively, in red and yellow. We estimate shoulder orientation on a snapshot-by-snapshot basis, considering depth “imagelets” centered on a pedestrian. The sub-panel (b) reports an example of such an imagelet with the coordinate system considered. We employ instantaneous direction of motion
extracted from preexisting trajectory data as training labels for a neural network. This yields a reliable estimator for the orientation
, accurate even in cases challenging for humans, like in (c). Due to clothing, arms and body posture, presence of bag-packs or errors in depth reconstruction, the overhead pedestrian shape might appear substantially different from an ellipse elongated in the direction of the shoulders.
Network [12] (CNN) point-estimator which operates on overhead depth images centered on individual pedestrians - from now on referred to as “imagelets”. Intuition suggests that pedestrians seen from an overhead perspective have a well-defined elongated, elliptic-like, shape. In our measurements this is true only in a small fraction of cases in which pedestrians walk carrying their arms alongside the body. Conversely, we found a majority of exceptions, impossible to address by hand-made algorithms (cf. Fig. 2). This marks an ideal use-case for supervised deep learning [12].
It is well known that the high performance of Deep Neural Network methods come also at the price of, often prohibitively, labor-intensive manual annotations of training data (frequently in the order of millions of individual images). Depending on the context, the reliability of human annotations can furthermore be arguable, this is
Figure 3: Examples of synthetic imagelets that we employ to analyze the performance of our neural network. Contrarily to the real-life data, ground-truth orientation is available for synthetic data, enabling accurate validation of the estimations. The neural network is trained against labeled target data with predefined noise level (= 20
) to imitate training with real-life imagelets and velocity target data. Target data with predefined noise level and ground-truth orientation for validation are superimposed on the imagelets as blue and red bars respectively.
the case whenever different experts are in frequent mutual disagreement about the annotation value. Shoulder orientation in depth imagelets falls in such a case. Here by relying on the strong statistical correlation between individual velocity and body orientation, we manage to produce potentially limitless annotations. While walking on straight paths, our velocity direction is (on average) in very good approximation orthogonal to our shoulder line. On this basis, we can employ the velocity direction as a singularly slightly imperfect, but correct on average, annotation for the orientation. Notably, the zero-average residual error between the velocity direction and the actual orientation gets averaged out as we train our CNN point-estimator with gradient descent. This (self) amends for annotation errors.
We investigate the orientation measurement accuracy of our method and consider its error scaling vs. the size of the training set using both real-life and synthetic depth imagelets. Combining extensive training with the enforcement of O(2) symmetry of the estimator, we show that we can deliver an orientation estimator with an error as low as 7.5 degrees. Our tool enables us to characterize the stochastic process that connects the instantaneous velocity direction to the shoulder orientation. We show that the velocity orientation can be well described by delaying the orientation dynamics through a stochastic process centered on, about, 100 ms and with OrnsteinUhlenbeck (OU) statistics.
Conceptually speaking, although our tool has been devised for depth imagelets, it can be easily extended to other computer vision-based pedestrian tracking approaches and, more in general, can be used for any system in which there is a statistical connection between (average) individual “particle” velocity and (average) shape.
Let I be a overhead imagelet centered on a pedestrian, see examples in Fig. 2 (for convenience we opt for imagelets of square shape, yet this is not a constraint).
We define the shoulder-line orientation angle, , as the angle between the direction normal to the shoulder-line and a fixed reference, here the y axis (direction
, cf. Fig. 2(a,b)). According to this definition, a body rotation of 180
leaves
unchanged. Thus, we aim at a function f such that
where f(I) = approximates the actual orientation
(with
2, i.e.
is an element of the real projective
line ), see e.g. [13]).
We model the mapping f via a deep neural network that we train in a supervised, end-to-end, fashion (see
structure in the Supporting Information, SI). The network returns the estimate of as a discrete probability distribution,
, on
) (quantized in B = 45 uniform bins, 4
wide, via soft-max activation function in the final layer). We retain the
)-average (“circular average”) of the distribution
, as final output. It formulas,
we leave the details to the SI. We train with orientation data with a “two-hot” encoding: each orientation is unambiguously represented in terms of a probability distribution non-vanishing on (up to) two adjacent angular bins (we chose “two-hot”
in opposition to the typical one-hot training data for classification problems, in which the annotations are Dirac probability distributions on the ground-truth class). We will refer to this encoding, that avoids quantization errors, as ) (we observed no strong sensitivity on the number of bins when these were more or equal than 10). As usual, we use a cross-entropy loss,
).
We employ pedestrian velocity information to tackle the need for huge amounts of accurately annotated data to train the free parameters of the deep neural network (usually in the millions, in our case). Let
) be the angle between the walking velocity and a reference at time t > 0, i.e.
where ) is the instantaneous velocity, and
) denotes the angle comprised the directions in its argument (with
-periodicity). Our shoulder line is most-frequently, and in very good approximation, orthogonal with respect to the walking velocity, i.e.
Therefore, velocities provide a meaningful “proxy” annotation for orientation. We used the “approximately equal” sign in (4) because we can have frequent, yet small, disagreements between velocity and orientation. These can be due to small loss of alignment between the two (e.g. because something attracted our attention) or they can be due to inaccuracies, e.g., in the velocity measurements. It is also possible, yet less likely, that velocity and orientation remain misaligned for longer time intervals. This holds, e.g., for people walking sideways. We retain these as rare occasions, which we expect to occur symmetrically for both left and right sides, with no relevant weight in our training dataset. This hypothesis reasonably holds on unidirectional pedestrian flows happening on rectilinear corridors, but might be invalid in case, e.g., of curved paths. Formally, for a walking person, we model the relation in (4) as
with being a small, symmetric, and zero-centered residual.
We train our neural network using the labels ) as a proxy for
). The training process aims at the minimization of the (average) loss
))]. As such, the output
converges to the distribution of annotations of similar imagelets, whose average is the correct point-estimation of the orientation:
for all transformations = Φ
(2), that concatenate a rotation of
, and, possibly, a reflection (i.e. Φ
, respectively the identity and the reflection, from which the sign change given by the determinant of the transformation: det(
) = det(Φ) =
1).
Symmetries in neural networks are often injected at training time, by augmenting the training set by all the symmetry group orbits. Similarly, we include multiple copies of the same imagelets with multiple random rotations with and without flipping. This also ensures that the training set spans ) uniformly. Yet, this does not yield a strictly O(2)-symmetric estimator ((7)). We further enforce this symmetry by constructing a new map, ˜f, as the O(2)-group average of f, which is thus strictly respecting (7). In formulas it holds
we leave the proof of this identity, the O(2)-symmetry of ˜f and further details on )-averages to the SI. In the following, we consider approximations of the integral in (9) by equi-spaced and random sampling of O(2).
Figure 4: (a-c) Velocity direction and shoulder orientation signal, for three trajectories collected in real-life (depth maps sequences similar to Fig. 2 are on the right of each panel). We report the instantaneous values of velocity (obtained from tracking) and orientation, and
, and the continuous orientation signal
) (low-pass filter of
). Orientation has been computed via our CNN trained on 30 hours of real-life velocity data. Panel (a) reports a typical pedestrian behavior, where
) and
) oscillate “in sync” (frequency
8 Hz) following the stepping. Our tool resolves correctly also rare sidestepping events or orientation of standing individuals in which the signals are out of sync. Panel (b) shows a pedestrian rotating their body, possibly observing their surroundings, while maintaining the walking direction. Panel (c) shows an individual initially standing, then performing a 150
body rotation and finally walking away. In this case, the velocity
is undefined for time t < 10 s as there is no position variation (so the high noise in
; cf. yt-diagram in which the spatial coordinates are constant in the horizontal segment of the trajectory). (d-f) Prediction performance of the network, f ( (1)), in case of artificial imagelets (d) and real-life data (e). We train with datasets of increasing size (N, x-axis). We report the Root Mean Square Error of the predictions averaged over M = 32 independent training of the networks (ARMSE, (11)) and, in the inset, the average bias, ˆb ((10)). The test sets used to compute the indicators include, for (d), 25k unseen synthetic images with error-free annotation (
) and, for (e), 25k unseen real-life imagelets, annotated considering low-pass filtered high-resolution orientation estimates,
, obtained with our neural network trained with 1 M samples and O(2)-group averaging. The bias, in both cases, decreases rapidly below 0
. The ARMSE, for the networks trained with the largest dataset approaches, respectively, 5
and 11
, as N grows. For both ARMSE and bias, we report the fitted exponents characterizing the error converge in the label. We complement the evaluation of the ARMSE considering noisy labels (
for case (d) and
for case (e)). In case (d) the ARMSE saturates consistently with the level of noise in the labels (cf. SI). In case (e), the ARMSE approaches a saturation point at about 20
. This reflects the random disagreement between velocity and orientation. (f) Performance can be further increased by enforcing O(2)-symmetry of the orientation estimator, map ˜f, (8). In panel (f) we consider maps ˜f built from the networks trained with the largest training datasets from (d,e) (
5 M), both for the synthetic and real-life cases, vs. the number of samples used for the group average, k. We consider both uniform and random sampling of O(2) (superscript U and R respectively). The group average further reduces the ARMSE from 5
to 4
in case of synthetic imagelets
(no observable difference between uniform and random sampling), and from 11
to 7
in case of real-life imagelets
, with higher performance in case of random sampling for k < 16.
We consider two types of training/testing imagelets: algorithmically generated, “synthetic”, imagelets, of which the orientation angle is known, and real-life imagelets. In the first case we mimic a velocity-based training by adding a centered noise to labels known exactly (following (5)). In the second case, as we have no manually annotated ground truth, of which the accuracy would nevertheless be debatable, we propose a validation based on the convergence towards low-pass filtered orientation signals. In both cases, we show that the average prediction error (ARMSE, (11)) is about 7
degrees or, possibly, lower, should the training set size N be large enough. Specifically, the datasets are as follows:
Synthetic dataset. We generate synthetic imagelets mimicking the overhead shape of people in terms of a superposition of two ellipses: one for the body/shoulder, , and another one,
, at lower depth values (i.e. higher on the ground), for the head. We report examples of such imagelets in Fig. 3, while the details of the generation algorithm are left to the SI.
By construction, the rotation angle of
represents the pedestrian orientation, i.e. it is the ground truth for the training. We train the network with such synthetic imagelets and a small centered Gaussian noise
(0
18
) superimposed to
to imitate velocity-based training. Hence, we train using labels
while we validate with
(cf. Eq. 5).
Real-life dataset. We consider depth images and velocity data from a real-life measurement campaign conducted during a city-wide festival (GLOW) in Eindhoven, The Netherlands, in Nov. 2017. The measurements involve a uni-directional crowd flow passing through a corridor-shaped exhibit (tracking area: 12), for further details see [3]. The dataset leverages on high-resolution individual localization and tracking based on overhead depth images (as in Fig. 1) and with 30 Hz time sampling. The localization and tracking algorithms employed are analogous to what employed in previous works [4, 5]. To ensure that our velocity data provides a well-defined proxy for orientation, we restrict to pedestrians having average velocity above 0.65 m/s. Moreover, for each trajectory we extract imagelets and velocity data with a time sampling of ∆
5 s, which increases the independence between training data. Additionally, we apply random rotations and random horizontal flips to all imagelets (and, correspondingly, to labels). This aims at training with a dataset uniformly distributed on
).
In absence of ground truth, we build our test set as follows: we rely on our neural network trained with 1 M different imagelets (i.e. twice as much the largest training dataset considered in Fig. 4(d,e), on which we perform random augmentation and final O(2)-averaging of the operator), hence the most accurate, to make orientation predictions over complete pedestrian trajectories. As an orientation signal ) needs to be continuous in time, we smoothen the predicted
) in time (low-pass Butterworth filter [15] of order n = 1, cutoff frequency
= 2.0 Hz and window length l = 52) to eliminate random noise.
We assess the prediction performance as the training set size, N, increases. To compute exhaustive performance statistics, for every N, we train the network on M independent datasets. Given a reference orientation (e.g. ground truth), , for imagelet I from dataset
= 1, 2, . . . , M), we consider two statistical indicators: (S1) the average prediction bias, ˆb, evaluated as the root-mean-square (among the M networks) of the average
(S2) the average root-mean-square error (ARMSE), i.e. the average (on the M networks) of the individual RMS error
In Fig. 4(a-c), we report the orientation signals as estimated by the networks in three different real-life contexts. The network is capable of accurate predictions that, as expected, are independent of the actual instantaneous velocity. Hence, it remains accurate in case of a pedestrian walking sideways (Fig. 4(b)), in which the orientation signal loses temporarily coupling with the velocity orientation and in case of a pedestrian temporarily stopping and standing (Fig. 4(c)), in which the velocity orientation is even undefined (note that these cases were excluded from the training).
We include in Fig. 4(d-f) the values of average prediction bias and ARMSE as the training set size increases, in case of synthetic and real-life imagelets (respectively, in panels (d) and (e)). In both cases the network performance increases with N, with slightly slower convergence rate for the ARMSE for the real-life dataset, which is likely more challenging to learn than the synthetic one. In both cases the predictions are free of bias (cf. sub-panels). With the largest number of training imagelets considered (10
), we measured an ARMSE of about 5
for the
Figure 5: Probability distribution function of the delay time between the shoulder orientation, ), and velocity orientation,
), signals for different average velocities, ˆv. As the average velocity grows, the average delay and the delay fluctuations reduce. The inset reports the ratio between the standard deviation,
, and the average,
, of the delay as a function of ˆv (Measurements from 78k trajectories (all not exceeding a maximum orientation of
, 20 hrs of data), acquired during the GLOW event. [3]).
Figure 6: Comparison between simulations (red dots) and real-life measurements (blue dotted line). We build velocity direction signals ) on top of delayed orientation measurements
)), where the delay d(t) is modelled by a OU random process (cf. (12), (13)). Measurements have been acquired during the GLOW event (36k trajectories, restricting to people keeping normal average velocity ˆ
3 m/s). In (a) we report the probability distribution function (pdf) of the difference between velocity direction and orientation shifted in time by the average delay, ˆd = 0.1 s. In (b), analogously to Fig. 5, we report a PDF of the delay time between
) and
). The insets in (a,b) report the data in semi-logarithmic scale. For both these quantities we observe excellent agreement among simulations and measurements for (a,b). Panel (c) shows the velocity direction signals’ grand average power spectral density (psd) of
) and
). Our model modifies the psd only at high frequencies. As an effect, the most energetic components of the velocity orientation, around 0.2 Hz and 1 Hz, remain, respectively slightly under and slightly over-represented.
synthetic data and 11for the real-life data. We managed to further reduce this error to, respectively, 4
and 7
by enforcing O(2) symmetry. Note that we could trivially apply (8) as we are in a bias-free context, else a systematic correction for the bias would have been necessary. In Fig. 4(f), we report the network performance as we approximate better and better the O(2) group average.
We are now capable of investigating with high-resolution, and in real-life conditions, the connection between
shoulder orientation and velocity direction - which, in the previous sections, we reduced to the error term . In particular, we can characterize a stochastic delay signal, d(t), which allows us to model the relation between velocity
where A is a positive constant.
First, thanks to the high-accuracy of our tool, we measure a velocity-dependent delay between velocity orientation and shoulder orientation whose probability distribution function is in Fig. 5 (see SI for details on the delay measurement algorithm). The velocity orientation follows in time the shoulder yawing, with a delay that decreases (on average) between 160 ms and 100 ms as the average walking velocity, ¯v, increases from 0.6 m/s to 1.4 m/s (respectively walking speed values in leisure and normal walking regimes, see, e.g. [16]).
The structure of d(t) appears well-modeled by a OU process:
where ˆd > 0 is the average delay ( ˆ0 is the OU time-scale and
0 is the intensity of the
-correlated white noise ˙W. In particular, in Fig. 6 we compare statistical observables of measurements and simulations considering the case of normal walking speed (average velocity ˆ
3 m/s), of which we retain the measured orientation signals,
), as a basis for (12) (simulation parameters: A = 1.85, ˆd = 0.08 s,
= 1.2 s and
= 1.85). In Fig. 6(a), we report the pdf of the difference between orientation and velocity orientation when one is shifted in time by, ˆd, to compensate for the average delay. Measurements and simulations, in excellent mutual agreement, follow a Gaussian statistics. Thanks to a stochastic delay, we achieve a very good quantitative agreement in the delay distributions (Fig. 6(b)). In Fig. 6(c), we report the Power Spectral Density (psd) of
) and
) computed by averaging all the psds obtained from individual velocity direction and orientation signals. We observe that the stochastic delay does not substantially modify the psd of orientations, especially at low frequencies. As an effect, the peak around f = 1 Hz, connected with the walking fluctuations remains slightly underestimated, while larger scale fluctuations (
1 Hz), effectively not modeled by (13), are over-represented.
In this paper we presented an extremely accurate estimator for the pedestrian shoulder-line orientation based on deep convolutional neural networks. We leveraged on statistic aspects of pedestrian dynamics to overcome two outstanding issues related to deep networks training: the labor-intensive annotation of training data in sufficient amounts (generally millions of images) and the accuracy of annotations in non-trivial contexts.
Thanks to the strong statistical correlation of shoulder-line and velocity direction, which are typically orthogonal, we can employ the velocity direction as a training label. Although often slightly incorrect, it remains correct on the average, to which our point-estimator converges. Notably, the relation between velocity and orientation holds regardless of the quality of the raw imaging data employed. In case of overhead depth maps, as used here, often we had disagreement between human annotators, which would possibly unavoidably yield low quality labels. By using velocity we can circumvent this issue and produce training data in arbitrarily large amounts. It should also be stressed that this approach can be conceptually extended to other imaging formats, such as color images, provided accurate and sufficiently prolonged tracking data are available.
Our tool unlocked the possibility to accurately investigate the relation between velocity direction and shoulder orientation. We could measure a velocity-dependent delay of about 100 ms between the first and the second, that we are able to quantitatively reproduce in terms of a simple Ornstein-Uhlenbeck process. In particular, on the basis of measured orientation signals, we could generate velocity directions featuring amplitude with respect to the orientation signal, velocity-orientation delay distribution and power spectral density in very good agreement with the measurements.
Our velocity-trained network could be possibly employed to investigate conditions of static crowds, clogged bottlenecks conditions, or other scenarios in which the “nematic” ordering of the crowd is expected to play a key role in the dynamics.
A.C. acknowledges the Talent Scheme (Veni) research programme (project N. 16771) financed by the Netherlands Organization for Scientific Research.
[1] H. Pontzer, J. H. Holloway, D. A. Raichlen, and D. E. Lieberman, “Control and function of arm swing in human walking and running,” J. Exp. Biol., vol. 212, no. 4, pp. 523–534, 2009.
[2] Microsoft Corp., “Kinect for Xbox 360,” 2012. Redmond, WA, USA.
[3] A. Corbetta, W. Kroneman, M. Donners, A. Haans, P. Ross, M. Trouwborst, S. Van de Wijdeven, M. Hulter- mans, D. Sekulovski, F. van der Heijden, et al., “A large-scale real-life crowd steering experiment via arrow-like stimuli,” Pedestrian and Evacuation Dynamics 2018, to appear. arXiv preprint arXiv:1806.09801, 2018.
[4] A. Corbetta, C. Lee, R. Benzi, A. Muntean, and F. Toschi, “Fluctuations around mean walking behaviours in diluted pedestrian flows,” Phys. Rev. E, vol. 95, p. 032316, 2017.
[5] A. Corbetta, J. Meeusen, C. Lee, R. Benzi, and F. Toschi, “Physics-based modeling and data representation of pairwise interactions among pedestrians,” Phys. Rev. E, vol. 98, p. 062310, Dec 2018.
[6] H. Yamamoto, D. Yanagisawa, C. Feliciani, and K. Nishinari, “Body-rotation behavior of pedestrians for collision avoidance in passing and cross flow,” Transport. Res. B-Meth., vol. 122, pp. 486–510, 2019.
[7] C. Feliciani and K. Nishinari, “Pedestrians rotation measurement in bidirectional streams,” in Pedestrian and Evacuation Dynamics 2016, pp. 12–1–12–9, University of Science and Technology of China press, 2016.
[8] M. Boltes and A. Seyfried, “Collecting pedestrian trajectories,” Neurocomputing, vol. 100, pp. 127–133, 2013.
[9] S. Seer, N. Br¨andle, and C. Ratti, “Kinects and human kinetics: A new approach for studying pedestrian behavior,” Transport. Res. C-Emer., vol. 48, pp. 212–228, 2014.
[10] D. Brˇsˇci´c, T. Kanda, T. Ikeda, and T. Miyashita, “Person tracking in large public spaces using 3-d range sensors,” IEEE Trans. Human-Mach. Syst., vol. 43, no. 6, pp. 522–534, 2013.
[11] W. Kroneman, A. Corbetta, and F. Toschi, “Accurate pedestrian localization in overhead depth images via height-augmented hog,” Pedestrian and Evacuation Dynamics 2018, to appear. arXiv preprint:1805.12510, 2018.
[12] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
[13] T. Gowers, J. Barrow-Green, and I. Leader, The Princeton companion to mathematics. Princeton University Press, 2008.
[14] L. C. Grove, Classical groups and geometric algebra, vol. 39. American Mathematical Soc., 2002.
[15] I. Selesnick and C. Burrus, “Generalized digital butterworth filter design,” IEEE Trans. Signal Process., vol. 46, 05 1998.
[16] J. J. Fruin, Pedestrian Planning and Design. Elevator World Inc., 1987.
[17] S. R. Jammalamadaka and A. Sengupta, Topics in circular statistics, vol. 5. world scientific, 2001.
[18] D. J. MacKay and D. J. Mac Kay, Information theory, inference and learning algorithms. Cambridge university press, 2003.
[19] A. Corbetta, V. Menkovski, and F. Toschi, “Weakly supervised training of deep convolutional neural networks for overhead pedestrian localization in depth fields,” in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6, IEEE, 2017.
Arithmetics of angles on P
We choose to parametrize the projective line, ), with the interval [
). An angular value,
, is reported to this parametrization of
) through the wrap function, defined as
Weighted averaging operations on ) (e.g. (2) and (8)) are computed in this parametrization as
where ) is a probability density function on
). Namely, angles
are converted to corresponding Cartesian points on the unit circle (i.e.
(cos (2
sin (2
))), then a vector average weighted by
) is performed, and the final result is mapped back to
) via
arctan2(
). Note that (16) is not defined whenever the vector average vanishes. This happens, for instance, when
) is uniform. For further details on arithmetic of periodic variables, we refer to [17].
“Two-hot” encoding
Our neural network outputs a discrete probability distribution over B = 45 classes. We interpret it as a probability over [2) once partitioned in B equal adjacent intervals of size
and centered around the mid-value
1)+
, with
. Let
) be the considered probability distribution. In support of the required circular properties, we enforce adjacency of the outer bins, such that they “wrap” around
), and
) =
) for all
holds.
In the continuous case, , a ground truth annotation (or the ideal output of the neural network) is a Dirac delta probability distribution,
), centered on the true angle
. From this,
can be recovered via the expectation in (16):
]. For finite B, we prevent quantization errors by unambiguously encoding angles in (up to) two adjacent bins, from which the name “two-hot” encoding (vs. the “one-hot” encoding in standard classification problems). In particular we encode the angle
as
with N being a normalization constant and being the wrapped distance ((15)) of
and the mid-angle of the i-th bin. Note that the wrapped difference ensures that both
2 and
2 are encoded into the same distributions, i.e. the network output remains unchanged for a 180
body rotation. By applying (16), we recover the annotation angle
.
Variance of and cross-entropy based error averaging
Figure 7: Dispersion (standard deviation) in the neural network output probability distribution ) versus the amplitude (standard deviation) of the error term
) (between training annotations and ground-truth orientation in accordance with (5)) for synthetic imagelets. We train M = 16 networks for each
20} and evaluate
) by averaging the standard deviations of the outputs
among the M networks on 10k synthetic test imagelets. These measurements are reported as blue dotted lines. We observe good scaling agreement between
) and the theoretical lower limit
) (cf. proof in the text for the simplified case), indicating that our training converges to the average annotation for similar imagelets.
where ˆis the (discretized) probability distribution of the training data. Notice that, by construction, the
)-average of
is exactly
. Proof. Let
have values
) on the B bins. The average loss, L, reads
where the last equality follows from the definition of Dirac mass. We can sort and aggregate the elements in the sum in (21). In particular, let #be the total number of annotations having value
, (21) yields:
by Gibbs’ inequality (see, e.g., [18]), it holds
Therefore, at the absolute minimum for is the distribution of the labels, which in our case is ˆ
, i.e.
Note that for a two-hot encoding the proof is identical, but each annotation is a convex combination of two delta masses.
In general, in a sufficiently ample dataset of depth imagelets (and related velocity annotation) acquired in absence of biases compromising the relation (5), we expect to find a wide number of similar imagelets, yet with different annotations, and this provides a rich sampling of the distribution. Abstracting from the previous proof, we expect the training process to be such that for each set of similar imagelets, the network would learn and output the probability distribution of annotations. We prove this experimentally, by means of synthetic imagelets. In Fig. 7 we compare the amplitude of the symmetric centered error,
), with the standard deviation of the predicted distribution (averaging over a test set of 10.000 images), showing that they perfectly correlate for
.
Neural network structure and training
We consider a neural network inspired by the VGG model, whose full structure is in Figure 8. We implemented the network using the Keras library.
We train the network by randomly augmenting the training images at the beginning of every epoch. Specifically, we apply random rotations and random horizontal flips (and we act correspondingly to the associated labels) to all imagelets. This ensures a training dataset uniformly distributed on ). A pre-processing standardization step of the depth intensity is applied individually to all the imagelets.
We employ the Adam optimizer with a batch size of 64, we retain the model that scores the lowest RMSE over a total number of 25 training epochs.
Figure 8: Detailed structure of the neural network. The network is fed with single channel imagelets (4040 pixels) after which the input data propagates trough two stacks of convolution, max-pooling and batch normalization layers for feature extraction. A convolution and batch-normalization layer connects the feature maps with a fully connected layer (ReLU activation function). The final softmax activation yields a probability mass function
) on 45 adjacent equal bins as output of the last layer. The network is trained using cross-entropy as loss function.
Map and O(2)-group averaging
In this Section, we deduce identity (8), and prove that the map ˜f is strictly respecting O(2) symmetry (cf. (7)). The identity between (8) and (9) can be proved by substitution, considering the fact that O(2) can be decomposed into rigid rotations and rigid rotations applied after a reflection, from which (26):
We prove that ˜f respects (7) by addressing the cases Φ = Id (does not include a reflection) and Φ =
does
include a reflection).
which, after applying the rotation of , becomes
The order in which mirroring and rotation are applied determines the sign of the rotation. For an angle , the identity
= (
) +
) =
holds. By using this fact, we get
which, after applying the transformation , becomes
Hence, by combining (34) and (43), the proposition holds
a random sampling, whose results are reported in Fig. 4.
Time delay of two signals
Let ) and
) be the Fourier transform of the signals
) and
), respectively. By applying the argument operator, arg (
), we can compute the corresponding phase as function of the frequency, i.e.
) and
). This enables to compute the delay time for each frequency component:
We retain as characteristic delay time between the signals the value ), provided
exists, according to the following procedure: considering the frequency range
[0.6, 1.2] Hz, where the typical walking fluctuations occur, we compute
where and
are the energy spectra that can be computed by application of the module operator abs
to the
) and
). We set
=
if
, i.e. velocity and orientation are synchronized. Else, we discard the trajectory from the computation of the delay.
Generation of synthetic data
We generate real-life imagelets mimicking the overhead shape of people in terms of a superposition of two ellipses: one for the body/shoulder, , and another one,
, at lower depth values (larger height), for the head (cf. Figure 3(a,b)). We characterize each ellipse, by 6 random scalars: Cartesian coordinates
of its center, area
, eccentricity
, rotation angle
and depth value
(i.e. the gray shade in the colorization in Fig. 2, j = b, h). A bivariate normal random distribution around the imagelet center determines the position of the body ellipse. The head ellipse is superimposed at a uniform random position closely to the end of the body ellipse minor axis. We add exceptions that are often seen in real-life data (i.e. perturbations of the overhead elliptical pedestrian shape) by drawing 4 additional ellipses at random positions in the imagelets. Moreover, artifacts of adjacent pedestrians in imagelets due to local dense situations are represented in the dataset by cropping 9 imagelets from a 3
3 grid in which pedestrians are drawn at random positions relative to these grid points. We represent clothing artifacts, shape variations and imperfect depth reconstruction by applying random modifications to each synthetic imagelet (in the same spirit of [19]): these include pixel removal (i.e. replacing 15% random pixels with background pixel value), pixel addition (i.e. replacing 25% with median pixel value), depth translation (i.e. increasing the values of all foreground pixels with a single uniform random variable
15]) and Gaussian noise (
= 5). Finally, we smoothen by convolving the imagelets with a 3
3 averaging kernel, resulting in the synthetic imagelets of Figure 3. We report the imagelet generation algorithm in full in Algorithm 1.
Algorithm 1 Algorithm for the generation of synthetic imagelets (cf. Fig. 3). Each iteration of the nested for-loop draws a single pedestrian on a square background at relative distance d to create artifacts of adjacent pedestrians. Random variables (generated in lines 6
19) characterise two ellipses (drawn in lines 20 and 21) that represent the body and head. We introduce a 25% probability of drawing children, represented by smaller ellipses (lines 18 and 19). Additionally, perturbations of the elliptical overhead shape (e.g. due to backpacks, arms or posture) are imitated by drawing
= 4 small ellipses at random positions in lines 23
29. Finally, 9 imagelets are obtained by cropping around each of the 3
3 grid positions (lines 30-36).
end while
end for
end for
end for