We propose a method for automatic calibration of a traffic surveillance camera with wide-angle lenses. Video footage of a few minutes is sufficient for the entire calibration process to take place. This method takes in the height of the camera from the ground plane as the only user input to overcome the scale ambiguity. The calibration is performed in two stages, 1. Intrinsic Calibration 2. Extrinsic Calibration. Intrinsic calibration is achieved by assuming an equidistant fish-eye distortion and an ideal camera model. Extrinsic calibration is accomplished by estimating the two vanishing points, on the ground plane, from the motion of vehicles at perpendicular intersections. The first stage of intrinsic calibration is also valid for thermal cameras. Experiments have been conducted to demonstrate the effectiveness of this approach on visible as well as thermal cameras.
Index Terms—fish-eye, calibration, thermal camera, intelligent transportation systems, vanishing points
Camera calibration is of immense importance in the extraction of information from video surveillance data. It could either be used to deal with the perspective distortion of the object in the image plane or it can be used for photogrammetric measurements like distances, velocities, trajectories, etc. It is also fundamental for performing the multiview 3D reconstruction. Besides, with the aid of 3D information, it could also be used for vehicle tracking or object detection, robust to occlusion.
Owing to its importance, a significant portion of literature in computer vision addresses the problem of camera calibration. Almost all the methods of calibration could be categorized into two major approaches:-
• Vanishing Point-based methods –. The first method tries to exploits the properties of the 3D scene structure to find correspondences between the real-world and its 2D image captured by the camera. With the aid of these correspondences, intrinsic as well as extrinsic estimation could be made. Further, the accuracy can be improved with increasing such correspondences. The vanishing point methods are majorly based on estimating the orthogonal vanishing points in the scene and requires no a priori knowledge for recovering the extrinsic and intrinsics matrices.
If the camera is already pre-calibrated using 3D rigs or checker-board, this data could be used directly in the second stage. However, in most cases, this data is not always available and hence there is a need for automatic intrinsic calibration also. Most of the literature in traffic surveillance considers an ideal camera model which assumes that the pixels are perfectly square with zero skew and the optical center of the camera coincides exactly with the image center. The last assumption is not necessarily true and in such cases, the optical center is calculated in a slightly different fashion.
For extrinsic calibration in the context of traffic surveillance, the correspondences approach may require annotation of lane marking with its lane-width   or presence of ground control points, or using regional heuristics such as average vehicle dimensions  or speed. However, the majority of the above methods involve a human intervention or are locationdependent and are unsuitable for generalized auto-calibration. Thus for such applications, we make use of vanishing points. The vanishing points may be generated from the static scene structures or lane markings, or motion of vehicles and pedestrians . For robust auto camera calibration, it is always advisable that the calibration process is independent of the scene in general and hence our approach will use only the moving objects in the scene. This paper especially deals with the wide-angle lenses cameras, that are ideal for most of the photogrammetry applications. However, they are always accompanied by various distortion effects, among which fisheye effects are dominating. It is essential to remove such effects before vanishing points estimation. The remaining paper is organized into the following sections: Section II defines the Camera Model, Section III and IV describes the process of Intrinsic as well as Extrinsic Calibration, Section V presents the results of our approach and Section VI explains the Conclusion and Future Scope of the algorithm.
We have assumed our camera to obey the pin-hole camera model. In this model, perspective projection of a 3D point into the 2D image point can be represented as follows :
Fig. 1: (a) Original distorted image of the traffic Scene, (b) Extracting trajectories using KLT-tracker, (c) Filtering top 10 trajectories, (d) Undistorted image of the same scene
The first matrix on R.H.S is referred to as the intrinsic matrix, as it is only dependent on the internal properties of the camera. represents the focal length in x and y directions, s represents the skew present in the pixels. (cx, cy) represents the optical center. In our approach we have assumed that focal length in x and y direction are identical, the skew present is zero and the optical center is also the image center. On simplifying the equation of a camera which captures an image of size (HxW) becomes:
The second matrix in equation 1 is the extrinsic matrix and is composed of 3x3 rotation matrix, augmented with a 3x1 translation matrix. In our approach, it would be possible to estimate the rotation matrices and translation vector (with scale ambiguity), if no input from the user is provided. However, if the height of the camera from the ground is also provided, scale ambiguity could be resolved. To account for the deviation of the camera from the pin-hole model, we consider a distortion matrix separately. Here we assume only radial effects, fitting a polynomial model  as :
Here, denotes the undistorted and distorted normalized coordinates of the image and [k1, k2, k3] are known as distortion coefficients. r is defined as follows:
In this model, the distortion center coincides with the image center. Also, the fish-eye effect of distortion is assumed to be radial in nature.
A. Fish-eye Effect
A pin-hole type world to image mapping is possible only by a rectilinear lens which satisfies:
is the angle in in radians between a point in the real world and the optical axis, which goes from the center of the image through the center of the lens. f is the focal length of the lens and is radial position of a point on the image film or sensor. Since it abides by pin-hole model, it is also considered undistorted.
After observation across the datasets available, it was realized, to account for distortion, a global distortion model is to be used. The most common and simple among which was equidistant fish-eye. It is defined as follows :
is radial position of a point on the image film or sensor. Since it do not abides by pin-hole model, it is also considered distorted. The equations (6) and (7) can be used for computing forward mapping i.e. from distorted cordinates to undisorted cordinates and inverse mapping i.e. from undistorted cordinates to distorted cordinates respectively as follows :
Thus if we have enough correspondences between and , it would be possible to estimate the focal length as well as distortion coefficients using a least square approach.
To make the calibration independent of the scene, only moving objects were used. It is done because it is not always possible to have a similar geometric arrangement of static structures in every scene on which calibration is performed. However, moving objects could comprise of pedestrians and vehicles, either of which is very easily present in every traffic scene.
There is only one unknown in the intrinsic matrix and equations 8 and 9, i.e. the focal length f. Thus if we can determine f, the intrinsic matrix, and distortion coefficients could be computed. In general, on a road, it is assumed that vehicles move along a straight line. However, due to distortion(fish-eye effect), these straight lines become curved in the image. The relation between the undistorted image and distorted image is only dependent on the focal length. Thus, by adjusting the value of f to undistort the straight line, focal length can be computed. The calibration process starts with extraction of moving object trajectories from video footage as shown in the image. In a video it is assumed that, there are sufficient number of vehicles moving in orthogonal directions. To extract the distorted straight lines, moving pedestrians or vehicles are tracked in a video footage for a few seconds initially. Tracking is achieved by optical flow with sift keypoints. Multiple tracks are extracted and filtered as follows:
• If the key point is not tracked for more than 80 % of the video interval, it is rejected.
• If the total distance in pixels of a keypoints is greater than 1.2 times of its displacement, the keypoints is rejected. Once the trajectories are filtered, top ten longest trajectories are selected. These trajectories are fitted with straight lines using least square method. The sum of least square errors of all the trajectories gives us the estimate of the straightness of the lines. Now the distorted points on trajectories are undistorted
Fig. 2: Bimodal distribution of gradients of line segments
by varying the value of focal length from zero to diagonal of image i.e. the maximum possible focal length in pixels. The most appropriate focal length would be one with the minimum least square error. Once the focal length is found, intrinsic matrix is computed.
However, for estimating the distortion coefficients, the sift keypoints are to be normalized with intrinsic matrices as :
The undistorted cordinates for the same are calculated with the mapping equation 8. Now the cordinates of distorted and undistorted keypoints could be used to compute the Distortion coefficients using equation 11.
The steps of intrinsic calibration are shown in Fig 1
A. Vanishing Point Method
Given the intrinsics, the extrinsics are computed using vanishing points. Vanishing points are defined as points on image plane where parallel lines in 3D intersect. For calibration, minimum of two orthogonal vanishing points are needed. Due to undistortion, the focal length of the new image is different from the original image. Thus, we will have to first compute the new focal length. If we have two vanishing points say and the new focal length is computed using the relation :
Once the new focal length is known, the elements of rotation matrices computed from equations 14, 15, 16 :
Fig. 3: (a) Vanishing Point in x-direction, (b) Vanishing Point in y-direction
Assuming the road surface to be planar and constitutes a Z = 0 plane, the equation 2 is transformed to :
If we assume the origin of World Cordinate System corresponds with the center of image and height of the camera from ground is then:
As mentioned above, extrinsic calibration is also done using moving objects only. The motion of pedestrians is highly erratic thus, only the motion of vehicles is used for vanishing point estimation. Vehicles in most of the cases follow each other along a straight line. If there are multiple key-points on a single vehicle, it can be seen to originate or converge to a point in image plane, as vehicle moves closer to or farther away from the camera. If there are multiple vehicles moving along two orthogonal directions, orthogonal vanishing points can easily be detected.
The algorithm mentioned below is computationally expensive, and requires significant movement of vehicles in pixels, thus it is performed in one out of every six frames. The first step of calibration is to undistort the image with the help of distortion coefficients, as it is not easy to detect vanishing points from curves. It is followed by YOLO-v3  based detection of vehicles. This generates the regions that consists of vehicles with high probability. A mask of such regions is computed for every image on which processing is done. SIFT  key points are generated in the masked images and is matched with the keypoints in the consecutive masked images. The positions of these matched keypoints for any two frames produces multiple line segments. All the line-segments are stored for future processing.
Once, the whole short video-footage is processed, a list of line segments are generated. Every line segment is converted into polar form i.e (orientation, magnitude, and distance from origin). A histogram of orientation is computed from all the segments as shown in Fig 2. It is observed that it shows a bimodal distribution. The bimodal distribution implies that there are two major directions in which vehicles move. This bimodal distribution helps in generating clusters of line segments in each orthogonal direction. Most appropriate line segments are selectively picked using the distribution peaks with a threshold of 5 degree in either direction.
The detection of vanishing point is achieved by voting based system. For each cluster, every line segment is extended to a line. The size of accumulator is equal to one pixel and its value is initially set to zero. Through whichever pixels any line passes, is incremented by one. Thus the pixels with maximum votes are considered most likely positions of vanishing points. The votes in the two directions are shown in the Fig 3.
However, the point with maximum votes is not necessarily the vanishing point, and there can be more than one vanishing point because, we do not achieve a perfect undistortion. If more than one vanishing point exists, it may imply that the point with maximum votes lie some where in between the two or more vanishing points. In order, to account for this, we take the top 20 % of the pixels with maximum votes and compute the mean and standard deviation. The estimates of vanishing points is made equal to the sum of mean and two or three times of standard deviation. The columns of rotation matrices does not vary much with the value of constant multiplied with standard deviation.
Once the vanishing points are computed, the equations 13, 14, 15, 16 and 18, could be used for extrinsic calibration. To improve the rotation matrix and enforce constraints on it, SVD is performed on the matrix. The new matrix is defined as:
where S is an identity matrix. With the height of camera from the ground, translation vector is computed from equation 18.
Fig. 4: Top-view transformed Image
TABLE I: Comparison of focal length with the proposed method and checkerboard method
A. Accuracy of Estimated Focal Length
To estimate the accuracy of the computed focal length from the proposed algorithm, camera is calibrated from checker-board pattern. When using checkerboard pattern the focal length in x and y direction may not be same, to get the estimate its geometric mean is considered.
In the table I, Proposed column is one which uses the proposed algorithm for estimation of focal length and Checkerboard column uses checkerboard calibration. The results from 5 very different video-footages are used to account for generality of results. The accuracy is around 6.35% of truth value (assumming checkerboard to be ground trruth).
B. Accuracy of Estimated Distortion Coefficients
To estimate the accuracy of distortion coefficients, mean least square error in trajectories is computed with the predicted distortion coefficients and checkerboard calibrated coefficients. The results are tabulated in Table II
In Table II, the second column represents the mean error of distortion in trajectory in original image. The third column represents the mean error of distortion in rectified image with proposed algorithm. the last column represents the mean error in distortion in rectified image with checker-board calibration coefficients. It can be seen, in most cases proposed method performs better undistortion then the checkerboard calibrations coefficients.
TABLE II: Comparison of undistortion of trajectories before and after the Intrinsic Calibration with the proposed method and Checkerboard Method
C. Accuracy of Rotation Matrices
It was not possible to assess the accuracy of rotation matrices because, the rotation data was not computed when camera was originally set-up, only its video-footage was accessible. However, it is possible to visually estimate its accuracy. It is done by converting the image of a scene to a top-view plane transformed image as shown in Fig 4. In such images, the crosswalks will appear rectangular and not trapeziumlike, man-holes will appear circular rather than ellipse, road markings will appear parallel instead of intersecting. There are enough visual cues in such images to test the accuracy of the data visually.
The proposed algorithm can perform calibration automatically from a video-footage. The intrinsic calibration works well also for the thermal cameras, since calibration procedure is dependent on only moving objects. However, there are certain limitations associated with the proposed method :
• Tracking done using KLT tracker is not robust against occlusion.
• Assumptions of vehicles move along a straight line may be violated in certain scenarios.
• There should be sufficient number of vehicles moving in either directions to estimate vanishing points robust to noise.
Once, the calibration is performed it could be used in multiple applications concerning to photogrammetry, speed-monitoring, 3D-reconstruction, etc. This algorithm in future could also be made independent of the requirement of perpendicular intersections.
 O. Faugeras, Three dimensional computer vision: A geometric viewpoint. MIT Press, 1993.
 R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. New York, NY, USA: Cambridge University Press, 2003.
 R. Tsai, “An efficient and accurate camera calibration technique for 3d machine vision,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1986.
 Z.Zhang, “A flexible new technique for camera calibration,” in Proceedings of 7th International Conference on Computer Vision, 1999.
 B. Caprile and V. Torre, “Using vanishing points for camera calibration,” International Journal of Computer Vision, vol. 4, no. 2, pp. 127–139, Mar 1990. [Online]. Available: https://doi.org/10.1007/BF00127813
 J. Deutscher, M. Isard, and J. MacCormick, “Automatic camera calibration from a single manhattan image,” in Computer Vision — ECCV 2002, A. Heyden, G. Sparr, M. Nielsen, and P. Johansen, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 175–188.
 A. C. D. Liebowitz and A. Zisserman, “Creating architecural models from images,” in Proceedings of EuroGraphics, 1999.
 Fengjun Lv, Tao Zhao, and R. Nevatia, “Camera calibration from video of a walking human,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1513–1518, Sep. 2006.
 Zhaoxiang Zhang, Min Li, Kaigi Huang, and Tieniu Tan, “Practical camera auto-calibration based on object appearance and motion for traffic scene visual surveillance,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, June 2008, pp. 1–8.
 K. Ismail, M. A Sc, T. Sayed, E. , P. , N. Saunier, and P. Assistant, “Camera calibration for urban traffic scenes: Practical issues and a robust approach,” 01 2010.
 K. Ismail, T. Sayed, and N. Saunier, “A methodology for precise camera calibration for data collection applications in urban traffic scenes,” Canadian Journal of Civil Engineering, vol. 40, no. 1, pp. 57–67, 2013. [Online]. Available: https://doi.org/10.1139/cjce-2011-0456
 R. Bhardwaj, G. K. Tummala, G. Ramalingam, R. Ramjee, and P. Sinha, “Autocalib: Automatic traic camera calibration at scale,” in ACM BuildSys 2017. ACM, November 2017, [Best Paper Award Winner and Best Demo Award Winner]. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ autocalib-automatic-tra%ef%ac%80ic-camera-calibration-scale/
 I. Junejo and H. Foroosh, “Robust auto-calibration from pedestrians,” in 2006 IEEE International Conference on Video and Signal Based Surveillance, Nov 2006, pp. 92–92.
 J. I. Ronda and A. Vald´es, “Geometrical analysis of polynomial lens distortion models,” Journal of Mathematical Imaging and Vision, vol. 61, no. 3, pp. 252–268, Mar 2019. [Online]. Available: https://doi.org/10.1007/s10851-018-0833-x
 D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004. [Online]. Available: https://doi.org/10.1023/B:VISI.0000029664.99615.94