With the recent release of Microsoft HoloLens, a milestone in Augmented Reality headsets, and prospective releases of similar devices from companies such as Magic Leap and Meta, the possibilities of AR technology are becoming more realizable. However, none of the proposed headsets have native capabilities that allow augmenting human-to-human interactions.
With HoloFace, we want to introduce an open-source framework that allows Augmented Reality (AR) developers to localize human faces in 3D and estimate their attributes. This can in turn be used to augment the interaction between the user of the AR headset and other people.
Potential applications range from entertainment, where HoloFace can be used to augment faces seen by the user with 3D models of game characters, to medicine, where HoloFace can be combined with face recognition to automatically show data about patients seen by the doctor. Fig-
ure 1 shows an example of HoloFace with the subjects’ faces augmented with items and visual effects.
HoloFace uses the frontal camera of an AR headset to first detect and later track the subject’s face and its landmarks. The landmarks are subsequently used to fit the CANDIDE-3 [1] deformable 3D face model to the detected face. The pose of the fitted model, which consists of translation and rotation, is converted to the pose of the subject’s face in the world coordinate system. This allows for rendering items on, or around the subject’s face. The weights of the blendshapes of the fitted model are used to estimate the attributes of the subject’s face, such as whether the subject is smiling, opening his mouth etc.
Since HoloLens is a battery powered device, performance is a key factor, all of the methods that run on the device are chosen with performance in mind. This approach leads to trade-offs in speed vs accuracy. In order to reduce this problem we include two face tracking methods, one that is optimized for speed and runs on the device, and one that is more accurate and runs on a remote machine.
HoloFace is implemented using the Unity game engine, with the most computationally expensive elements implemented in C++ as plugins. It is important to note that while currently HoloFace only works with Microsoft HoloLens, it should be easily portable to other Windows Mixed Reality Devices (provided they have a frontal camera), once they become available.
The main contributions of the article are the following:
• a design of a framework that allows for face alignment and 3D head pose estimation on a low-power device such as Microsoft HoloLens,
• a simple and effective method for verifying face tracking failure in neural network based face alignment methods,
• open-source implementation of the proposed framework.
The remainder of this paper is organized as follows: section 2 reviews the related work, section 4 details the meth-
Copyright IEEE 2018 1
Figure 1. Examples of faces augmented with animations and overlayed objects using HoloFace. Photographs taken directly through the display of Microsoft HoloLens.
ods we have used, section 5 describes the implementation and section 6 explains the measurement of the HoloLens camera latency and verifies the accuracy of the face tracking and failure detection methods.
This article is based on previous work in several areas including: augmented reality, face alignment and head pose estimation. For brevity we will only discuss related work in the area of augmenting facial images which is the main topic of this paper.
The augmentation of facial images is a popular application of AR in industry. In the consumer sector, Snapchat and MSQRD have proposed mobile applications that render animations and objects on the image of a user’s face. In the professional sector, FaceRig, Adobe and others offer applications that allow for markerless facial motion capture and automated animation of character models.
Astonishingly, augmenting facial images on mobile devices has received relatively little attention in academia. One of the first works that touched on this topic was [6], where the authors propose a pipeline that uses face tracking and facial recognition on a smartphone. The application augments the faces of people seen through the smartphone’s camera with information about them based on their recognized identity.
One of the downsides of the method proposed in [6] is that the pose of the head is not estimated, which means that the face can only be augmented with flat objects. In contrast, [13] tracks the 3D positions of facial landmarks which allows for augmenting the face image with 3D effects. More recently, [20] proposed a method in which 2D landmarks are tracked and the head pose is estimated using a Perspective-n-Point (PnP) method which also allows for rendering 3D effects on the user’s face. In [3] the authors propose quite a different facial augmentation system, that displays the augmented content directly on the subject’s face using a projector. The use of specialized hardware allows for very high framerates and visually appealing effects, on the downside the augmented content can only be displayed on the face itself, not outside of it. Neither of
[13, 20, 3] are intended for mobile platforms.
HoloFace provides the locations of 2D and 3D landmarks (the latter in the form of the vertices of the fitted 3D model) as well as the head pose and facial attributes. Thanks to the nature of the HoloLens display it also allows for displaying content both on and around the subject’s face. Moreover, to the best of our knowledge no other work has previously covered the problem of augmenting faces, and consequently interactions, on Augmented Reality headsets.
Figure 2 shows an outline of the HoloFace framework. In the initial state, when no face is being tracked the framework performs face detection on the arriving frames. Once a face is detected its landmarks are localized using one of the two face alignment methods included in the pipeline (details in section 4.1). In the subsequent frames the face is tracked along with its landmark using the same face alignment method. During tracking the face alignment is initialized based on the landmark locations in the previous frame, taking into account the headset movement (details in section 4.4).
For each frame where the face is tracked the quality of the tracking is verified. Here we again use two different methods which are specific to the two face alignment methods, more details in section 4.4. If the quality of the tracking is less than a specified threshold the tracking is considered to have failed and the pipeline goes back to the face detection step.
If the landmarks have been located successfully the 3D head pose and the facial attributes are estimated using the methods described in sections 4.2 and 4.3. This step is followed by the denoising and prediction step which aims to both reduce the jitter and latency of the head pose. The prediction is necessary to compensate for the delay between image acquisition and rendering, more details in section 4.5.
The final head pose is passed to the Unity Engine which can use it to render any object on or around the subject’s face.
Figure 2. A diagram showing the outline of the HoloFace framework.
For the face detection step of the HoloFace framework we use the face detection method built into the Windows Universal Platform framework, the remaining methods are described below .
4.1. Face alignment
Many of the recently proposed face alignment methods are based on deep neural networks [15, 22, 10]. Unfortunately the computational complexity of deep neural networks limits their application on mobile devices. Most recently the authors of [4] have proposed a method for face alignment based on a binarized convolutional neural network which has a potential to run on a mobile device, however no test on an actual device were performed to validate this capability. On the other hand face alignment methods based on more traditional approaches such as regression trees [14, 17, 12] are fast enough to run on a mobile device, but offer less accurate results.
Face alignment is the most crucial part of our framework that all the other elements are based on. For that reason in HoloFace we implement two state-of-the-art face alignment methods: a method based on regression trees which is capable of running locally on the device [14] and a more powerful method based on deep neural networks which is intended to run on a remote desktop machine [15].
The first method, which we will refer to as KRFWS, is based on the work of Kowalski et al. in [14]. The authors of [14] propose a face alignment pipeline that uses novel K-Cluster Regression Forests with Weighted Splitting.
In HoloFace, KRFWS is intended to perform all of the image processing locally on the HoloLens. Because of that, we had to simplify the method to achieve reasonable processing speed. To that end we only use the base face alignment method without initialization refinement (called APR and 3D-APR in the original article). Moreover, we substitute the Pyramid Histogram of Oriented Gradients (PHOG) [9] features with standard Histogram of Oriented Gradients (HOG) [5] features and reduce their size to pixels per landmark. We also reduce the amount of face alignment stages to 3.
The second, alternative, face alignment method we use is Deep Alignment Network (DAN) [15]. Since this recently published method uses convolutional neural networks, it is too expensive to execute it locally on the HoloLens. Because of that, DAN is run on a remote computer equipped with a CUDA enabled GPU for fast processing. The communication between the AR headset and the computer takes place over WiFi. In order to increase the processing speed of DAN we only use a single stage of the neural network, as opposed to two stages employed in the original article. Examples of images with landmarks localized using DAN are shown in Figure 3.
There are many situations where the two approaches can be used complimentarily. For example, if an application based on HoloFace is mostly used within a given building it can use the more accurate remote backend while on WiFi, and easily switch to the local tracker when outside of range.
Section 6 contains a comparison of both methods in terms of tracking accuracy.
4.2. Head pose estimation
Since performance is a key factor in HoloFace, we propose a simple, fast, optimization based head pose estimation method. The pose of the head is estimated by fitting
Figure 3. Images from the 300-W dataset [18] with landmarks localized using DAN on the left and with the fitted CANDIDE-3 model on the right.
the CANDIDE-3 [1] deformable 3D face model to the localized landmarks. The CANDIDE-3 model consists of an average 3D face shape as well as a number of shape units and action units, which can be used to deform
. We refer to the shape and action units together as blendshapes and denote them with
where n is the total number of blendshapes. The mean shape
as well as the blendshapes consist of 113 vertices connected with 168 triangles.
The fitting is accomplished by solving the problem below using the Gauss-Newton method:
where s are the localized landmarks, K is the intrinsic matrix of the camera mounted on the AR headset, [R, t] are the rotation matrix and translation vector describing the head pose and are the blendshape weights. Since the set of landmarks localized using face alignment and the set of vertices in CANDIDE-3 might be different, we only use a manually selected subset of landmarks that are common between the two sets.
In order for the obtained pose to be metric we scale so that the inter-pupillary distance of
is equal to that of the average person - 63mm [7].
The head pose [R, t] obtained from optimization is in the coordinate system of the camera located on the AR headset. In order to render objects on the face we need to obtain its pose in the world coordinate system. To do so it is necessary to know the camera to world transform
of the headset at the time the image of the face was taken. This transform is easily obtained from the HoloLensForCV API [16].
Given that is known, the position of the head in the world coordinate system is calculated as follows:
where lookAt outputs a rotation that points the forward vector from to
(the camera’s position in the world coordinate system) .
Several examples of images with the mesh of the fitted CANDIDE-3 face model overlaid on them are shown in Figure 3.
4.3. Facial attribute estimation
The blendshape weights of the fitted model are used to estimate the facial attributes of the tracked face. If the weight of a given blendshape exceeds a predefined threshold the corresponding attribute is considered to be present. This can be used to trigger animations, recognize emotions, etc. The facial attributes we can recognize with this method are the following: smiling, eyebrow raising, mouth opening.
4.4. Face tracking
In most face tracking applications the camera is stationary and for each new frame the tracker is initialized with the facial landmarks from the previous frame. On AR devices the camera is moving together with the user’s head, which can lead to very large differences between the location of the face in the image in consecutive frames.
In order to compensate for the movement of the user’s head we take advantage of the fact that the pose of the HoloLens is known at every moment in time. We use the previously estimated head pose to obtain the 3D world space coordinates of the facial landmarks
from the previous frame. We then project them into the image space using the world-to-camera transform
of the current frame to obtain the initial landmarks for the tracker
.
where is the camera to world transform from the previous frame. It is important to note that
is not the same as
for the current frame as the headset might have changed its position between the moments the frames were taken. The world-to-camera and camera-to-world transform are obtained from the HoloLensForCV API [16].
As a result of the procedure described above, the initialization of the face tracker is invariant to the head movement of the headset user. This greatly decreases the number of times loss of tracking occurs and thus makes the experiences more stable.
It is important to note that loss of tracking may still occur, for example, if the subject’s face moves very quickly or becomes occluded. Because of that it is necessary to detect it and reinitialize the tracker if it occurs. In HoloFace both face alignment methods (KRFWS and DAN, see section 4.1 for details) have their own methods to detect loss of tracking. The decision to use different methods is motivated by the fact that the two face alignment algorithms are based on very different principles.
In KRFWS we detect loss of tracking using a simple method based on the Supervised Descent Method [21]. In this method we first extract HOG features in patches around each localized landmark. The HOG features are subsequently concatenated and passed to a linear model which predicts the Sum of Squared Errors of the landmark positions. If the predicted error exceeds a predefined threshold the tracking is considered to have been lost. In order to save computational resources we use the HOG features extracted in the last stage of KRFWS instead of extracting new features. Thanks to this approach the loss of tracking detection adds very little overhead.
In DAN the loss of tracking is detected using an additional layer added after the penultimate layer of the original network. The layer consists of two neurons densely connected to the previous layer, followed by a softmax nonlinearity. This layer is trained to recognize whether the input images contains a face or not. Thanks to this approach we can perform training on very large datasets designed for face detection rather than on smaller datasets that have annotated landmark locations. The resulting method is capable of detecting loss of tracking and adds almost no additional computational overhead. It is important to note that this simple method can be easily added to nearly any neural network based face alignment method.
Figure 4. Comparison of HoloFace with and without the prediction step, both images are a part of a sequence where the subject’s head is moving from right to left. In the image on the left no prediction is performed and the object clearly lags behind the face, in the image on the right the proposed prediction scheme reduces the delay. The rendered landmarks indicate where the face was at the time of image acquisition.
4.5. Prediction and denoising
One of the key differences between augmenting facial images through a headset and augmenting them on a screen (as in popular smartphone applications), is the influence of latency. In the latter case even if the time between image acquisition and display is significant, the perceived latency is small because all of the augmentation is shown rendered on top of the original image. In case of a headset the user sees the scene and the objects rendered on top of it with his own eyes. This makes any delay between the image acquisition and rendering visible.
For example, if the subject’s head is moving then any objects rendered on top of that head will drag behind. This is caused by the fact that the pose of the rendered objects is based on where the subject’s head was at the time of image acquisition and in the time between acquisition and rendering the head has moved, for an example see Figure 4. In order to reduce this effect we employ a Kalman filter [11] to predict the location of the subject’s head at rendering time.
The formulation of the Kalman filter we use is similar to the one proposed in [3]. The position of the head is modeled by three Kalman filters, one for each axis. The state vector (x, v, a) consists of the position along the given axis x, velocity v and acceleration a. The only measured quantity is the position. The process transition A, process noise covariance Q and measurement noise covariance R matrices are defined as follows:
(9) where is the time step,
,
and
are the standard deviations of the acceleration and measurement noise, while
is the decay rate to white noise. Through testing on a prerecorded sequence we determine the optimal values of the parameters to be
,
. At those values the filters achieve the most accurate prediction of the face localization in 120ms. In testing we have found out that predicting the location of the face further into the future leads to a larger error, even if the actual delay is higher. This is caused by the predictions overshooting the correct position when the head changes direction.
At runtime is set to the time that elapsed since last measurement. The localization of the face that is passed to Unity Engine for rendering is predicted based on the time elapsed since image acquisition.
The Kalman filter also serves an additional purpose of reducing the influence of the noise in landmark localization on the head pose. In order to further reduce this influence we employ an additional filtering step in which we average the head pose over the two most recent frames if the head moved by less than 5mm and turned by less than .
HoloFace is implemented in the Unity game engine with the most compute intensive elements, such as local face tracking, implemented in a separate C++ plugin. The use of Unity allows for easy integration of HoloFace with existing software and easy repurposing to other applications.
Please note that, except for the remote facial landmark tracking method, all of the elements of HoloFace run locally on the device. The remote facial landmark tracking method, once enabled, runs for as long as the connection with the server is maintained and switches to the local method when connection is lost. To summarize, if there is no connection to the backend processing server, HoloFace runs entirely on the device.
In our implementation we present an application of HoloFace to entertainment which allows for rendering objects and animations on top of a face seen through the headset. The application is controlled using gestures and voice commands which allow the user to choose the effect being displayed, enable the remote tracker and enter a debug mode which shows the framerate and landmark locations. Thanks to the Unity Engine new objects and animations can be easily added or edited.
Examples of faces augmented with HoloFace are shown in Figures 1 and 5. The images shown in the figures above were all taken directly through the display of the HoloLens using a camera attached to the headset, one of the setups we used is shown in Figure 6.
We publish1 the entire code of HoloFace under the MIT License.
5.1. Datasets and training
We train both the KRFWS and DAN face alignment methods on the 300-W [18] and Menpo [25] datasets, with some additional images from the Multi-PIE [8] dataset for the KRFWS method. For the DAN method we additionally blur some of the input images for training to increase robustness against motion blur. Since the speed of the KRFWS method depends on the number of tracked landmarks we only use the 51 internal landmarks out of the 68 contained in the datasets, rejecting the landmarks on the face edge. The methods used for tracking verification are trained on the 300-W [18] dataset for KRFWS and on the WIDER [23] dataset for DAN.
5.2. Performance
For both tracking methods the framerate of the pipeline during tracking is 30 fps, which is the maximal framerate of the built-in webcam, this includes the latency caused by the network traffic in the DAN face tracker. The framerate drops to 17 fps when no face is tracked and detection is being performed. In order to obtain a high processing speed with the local tracker, we have created a highly optimized implementation of the KRFWS method that runs at over 1000 fps on a desktop PC equipped with a 4 core CPU. The single stage DAN tracker runs at 160fps on a NVIDIA GeForce 1070GTX GPU (this is the execution time only, excluding the network transfer).
The high framerate of the remote face alignment method, DAN, is possible thanks to an efficient data transmission scheme. Instead of transmitting the entire image captured by the camera, only a small image containing the face itself is sent. This particular size comes from the DAN method itself which uses this image size as input. Because of that the small image size does not have any adverse effect on the face alignment accuracy. Combined with auxiliary data the total bandwidth at 30 fps is only 3,1 Mbit/s.
The low bandwidth requirement potentially allows for using HoloFace with DAN tracking on GSM networks, using a mobile access point or, if future AR headsets allow it, a direct connection to the mobile network. This would allow a similar level mobility for the user with both the local and remote face tracking methods.
It is important to note that the framerate of HoloFace is independent of the framerate of the main thread of the application, which remains constantly at 60fps, which is the maximal value. This ensures a smooth AR experience and leaves room for additional load from other sources.
In terms of experience the two most visible issues of HoloFace are the delay caused by the acquisition time of the camera (more details in 6.1) and the noise produced by the local face tracker. The impact of both issues is signifi-cantly reduced by the filtering methods described in section 4.5. For details on how HoloFace performs, please see the video in supplementary materials.
Figure 5. Several frames from a sequence showing a face augmented with sparks flying out of the ears, the animation is enabled when the subject opens his mouth. Images taken directly through the display of Microsoft HoloLens.
Figure 6. A Basler Scout camera was attached to the HoloLens to capture materials for the Figures and supplementary videos.
6.1. HoloLens webcam delay
The webcam of Microsoft HoloLens, which is the only camera on the device currently available to the developers, has a significant delay between the moment the frame is acquired and the moment it is returned by the API. We have measured the delay using a method similar to the one proposed in [3]. A high speed camera (we used a Basler ace series camera) is attached to the headset so that it sees the display. Subsequently the HoloLens is set to display the most recent image acquired by the camera with no other tasks running. Then the rig consisting of the HoloLens and the camera is pointed at a high speed clock. The difference between the time seen by the camera on the clock and the time shown on the image displayed by the headset (also recorded by the camera) is the time between image acquisition and rendering. In our experiments this time was ms. Together with the processing time, the average delay between acquisition and rendering for the whole pipeline is 170ms.
6.2. Face tracking evaluation
In order to evaluate the two face tracking methods, as well as the novel failure detection method used in DAN tracking, we perform experiments on the 300-VW [19] face tracking dataset. The dataset consists of videos in three categories, where the third contains the most difficult sequences, more details in [19]. In order to compare to the state of the art we use the results of an evaluation of several methods recently published in [24]. Since the authors of [24] only use the 49 internal landmarks we follow the same protocol and reject the 17 landmarks on the face’s edge as well as the two inner mouth corners.
In each video we detect the face using OpenCV and track it using DAN and KRFWS until a loss of tracking is detected with the corresponding method, once that happens face detection is performed again. The experiments performed in [24] use a different strategy, the face is detected in every 30th frame and tracked in between. There is no tracking failure detection and the face detection method is more potent [27]. It also has to be mentioned that the methods used in HoloFace are trained on different datasets than the methods described in [24].
Following [24] we use the mean distance between the localized landmarks and the ground truth landmarks divided by the inter-ocular distance (the distance between the outer eye corners) as the error metric. This metric is also known as the normalized point-to-point distance. Based on this metric we plot the Cumulative Error Distribution (CED) curves shown in Figure 7. We also calculate the failure rate, which is the percentage of images with an error greater than 0.08, and the area under the CED curve up to the threshold of ). The results of the above experiments are shown in Table 1 and 2.
Even though only a single stage is used, DAN combined with our failure detection method obtains the lowest failure rate on all the subsets and the best score on the hardest category. The KRFWS is significantly less precise, with low
scores, the failure rate however is less than 10% for the first two subsets. Even though the results are poorer we believe that the use of this method is justified thanks to its extremely high efficiency (over 1000 fps on a desktop computer as mentioned above). The results of KR-
Figure 7. Cumulative distribution curves of face tracking methods on the 300-VW [19] dataset.
Table 1. of the face tracking methods on the 300VW test set.
Table 2. Failure rate percentage of the face tracking methods on the 300VW test set.
FWS would definitely be improved if the number of stages, or the feature size were increased, this would however lower the framerate on the HoloLens to less than 30 fps, which we tried to avoid at all cost.
We have presented HoloFace, an open-source framework for real time face alignment, head pose estimation and facial attribute retrieval for Microsoft HoloLens. With the help of our application Augmented Reality developers and researchers can add a new element of interaction into their projects.
Some of the areas we believe are interesting for future work on HoloFace are: development of a new local face tracking method, which would be more accurate and less noisy, validation of face tracking using the remote method over a GSM network.
We also hope that future Windows Mixed Reality devices will give the developers access to the low latency cameras used by the headset for tracking. This would significantly improve the experience thanks to the shorter time between image acquisition and rendering.
[1] J. Ahlberg. Candide-3 - an updated parameterised face. Technical report, Dept. of Electrical Engineering, Linkping University, 2001. 1, 4
[2] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discriminative response map fitting with constrained local models. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 3444–3451, June 2013. 8
[3] A. H. Bermano, M. Billeter, D. Iwai, and A. Grundh¨ofer. Makeup lamps: Live augmentation of human faces via projection. Comput. Graph. Forum, 36:311–323, May 2017. 2, 5, 7
[4] A. Bulat and G. Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In 2017 IEEE International Conference on Computer Vision, Oct 2017. 3
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Conference on Computer Vision and Pattern Recognition, pages 886–893, June 2005. 3
[6] M. Dantone, L. Bossard, T. Quack, and L. van Gool. Aug- mented faces. In 2011 IEEE International Conference on Computer Vision Workshops, pages 24–31, Nov 2011. 2
[7] N. A. Dodgson. Variation and extrema of human interpupil- lary distance. Proc. SPIE, 5291:36–46, 2004. 4
[8] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, May 2010. 6
[9] K. Hara and R. Chellappa. Growing regression forests by classification: Applications to object pose estimation. In 13th European Conference on Computer Vision, pages 552– 567, 2014. 3
[10] A. Jourabloo, M. Ye, X. Liu, and L. Ren. Pose-invariant face alignment with a single cnn. In 2017 IEEE International Conference on Computer Vision, Oct 2017. 3
[11] R. E. Kalman et al. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960. 5
[12] V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1867–1874, June 2014. 3
[13] V. Kitanovski and E. Izquierdo. 3d tracking of facial fea- tures for augmented reality applications. In 12th International Workshop on Image Analysis for Multimedia Interactive Services, 2011. 2
[14] M. Kowalski and J. Naruniec. Face alignment using k-cluster regression forests with weighted splitting. IEEE Signal Processing Letters, 23(11):1567–1571, Nov 2016. 3
[15] M. Kowalski, J. Naruniec, and T. Trzcinski. Deep alignment network: A convolutional neural network for robust face alignment. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, July 2017. 3, 8
[16] Microsoft. HoloLensForCV, 2017. 4, 5
[17] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1685–1692, June 2014. 3
[18] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 397–403, Dec 2013. 4, 6
[19] J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tz- imiropoulos, and M. Pantic. The first facial landmark tracking in-the-wild challenge: Benchmark and results. In 2015 IEEE International Conference on Computer Vision Workshop, pages 1003–1011, Dec 2015. 7, 8
[20] Z. Wang and X. Yang. V-Head: Face Detection and Alignment for Facial Augmented Reality Applications, pages 450– 454. Springer International Publishing, Cham, 2017. 2
[21] X. Xiong and F. D. la Torre. Supervised descent method and its applications to face alignment. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 532–539, June 2013. 5
[22] J. Yang, Q. Liu, and K. Zhang. Stacked hourglass network for robust facial landmark localisation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2025–2033, July 2017. 3
[23] S. Yang, P. Luo, C. C. Loy, and X. Tang. WIDER FACE: A face detection benchmark. CoRR, abs/1511.06523, 2015. 6
[24] A. Zadeh, T. Baltrusaitis, and L. Morency. Deep constrained local models for facial landmark detection. CoRR, abs/1611.08657, 2016. 7, 8
[25] S. Zafeiriou, G. Trigeorgis, G. Chrysos, J. Deng, and J. Shen. The menpo facial landmark localisation challenge: A step towards the solution. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017. 6
[26] J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment, pages 1–16. Springer International Publishing, Cham, 2014. 8
[27] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, Oct 2016. 7
[28] S. Zhu, C. Li, C. C. Loy, and X. Tang. Face alignment by coarse-to-fine shape searching. In 2015 IEEE Conference on Computer Vision and Pattern Recognition, pages 4998– 5006, June 2015. 8