The task of detecting and classifying damages in sewer pipes offers an important application area for computer vision algorithms. This paper describes a system, which is capable of accomplishing this task solely based on low quality and severely compressed fisheye images from a pipe inspection robot. Relying on robust image features, we estimate camera poses, model the image lighting, and exploit this information to generate high quality cylindrical unwraps of the pipes’ surfaces. Based on the generated images, we apply semantic labeling based on deep convolutional neural networks to detect and classify defects as well as structural elements.
This work is funded by the German Federal Ministry of Education and Research under grant number 13N13891.
The inspection of sewer pipes is a crucial task to ensure the functionality of sewage systems. Many sewer pipes in big cities are several decades old, some are even older than one hundred years. Therefore, regular risk assessment and sanitation planing is needed to ensure the correct functionality of the sewer system. At present, mobile robot systems equipped with cameras or other sensors are used to manually traverse the pipes. As a result, they produce large amounts of data in which defects have to be annotated manually by technical staff especially trained for this task. As a consequence, the obtained results are error prone due to the
repeating and tiresome work. In order to assist workers, reliable computer systems are needed that can give support by automatically detecting certain defects in sewer pipes. Such a system would typically consist of to modules: First, a preprocessing module that can convert raw input images into a form that can be automatically processed, and second, a detection/classification module that performs automated annotation of the provided data.
One early work for the estimation of camera poses in sewer papers was proposed by Cooper et al. . The authors exploited the longitudinal mortar lines for camera pose recovery, limiting the system to stonewalled pipes. In , the profile of sewer pipes is reconstructed solely from fisheye video sequences. The approach is based on the tracking of feature points for more than three views, which is not feasible in our application due to the distance of 5 cm and thus large changes between consecutive images. Furthermore, the system was only tested for concrete pipes, which have a relatively well structured texture for feature detection and matching. Esquivel et al. [2, 3] proposed a system for the reconstruction of sewer shafts exploiting the fact that the camera always faces downward due to the force of gravity. Therefore, the system is restricted to vertical pipes.
With our work, we present a system, which assists the employee with the automatic detection and classification of damages in sewer pipes. We solely use unrolled and stitched images (like the one in Figure 6) as input for the detection and classification algorithm. To obtain an unrolled fisheye image, 3d motion of the camera is tracked and a cylindrical image is generated through back-projection on an ideal pipe. This image can then be easily snipped and unwound into a planar image.
c2018 IEEE. Published in the IEEE 2018 Winter Conference on Applications of Computer Vision (WACV 2018), scheduled for 12-14 March 2018 in Lake Tahoe, NV/CA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE.
Figure 1: Sample fisheye image taken with a mobile robot inside the pipe. The center of the round image area on the sensor is given.
The second part of this work is aimed towards automatic detection and classification of defects and structural elements in the pipe. We treat this as a semantic labeling problem which is tackled using deep convolutional neural networks. To our knowledge previous methods for this task mostly relied on image processing algorithms and heuristics to detect various types of defects and structural elements. In  for example, the authors use edge detection and morphological operations based on CCTV images for crack and open joint detection. The authors of  propose a novel edge detection algorithm for thin crack detection, that can overcome some difficulties encountered in noisy environments like sewer pipes.
Besides those algorithms, few works also use machine learning to implement diagnostic systems based on a range of image processing techniques. Yang and Su  for example use SVMs and simple neural networks using wavelet transform and co-occurrence matrices to detect open joints, cracks and broken pipes. Another example is presented by Wu et al. in  where ensemble methods on contourlet transforms and statistical features are used to detect cracks, roots and collapsed pipes.
The images of the sewer pipe are taken with a mobile robot having a fisheye camera with a field of view of 185 degrees. Due to their characteristics and the viewing direction along the cylindrical pipe, the resolution of the depicted pipe surface decreases dramatically with increasing distance to the camera. In consequence, only the outer part of the circular image is used to produce the stitched unroll of the sewer pipe. To guarantee overlap big enough for registration, images are taken with five centimeter spacing.
The original images we obtain from the commercial robot system show strong artifacts from severe lossy compression and some image areas are overexposed because of the strong flashlights. In conjunction with the lack of texture information, classical approaches used by systems like Bundler  or VisualSfM [13, 12] fail to track our robot camera. We therefore simplify the problem by assuming a cylindrical shape with a known diameter (for the absolute scale). Instead of having one unknown depth parameter for each image feature, the 3d position of all features can now be related by one unknown rigid body transform with 6 unknowns for the entire frame. The resulting equation system and its solution are explained in the following sections.
The imaging characteristics of the circular fisheye lenses can be described by
with representing the angle of incidence, d the distance of the resulting image point from the image center in pixel, the field of view of the fisheye lens and D the diameter of the projected circular part of the image in pixel. For the unwrapping, we regulary sample the cylindrical surface of the inner pipe at N 3d positions which correspond to the pixels of the cylindrical image (please refer to Figure 2 for an illustration). One can choose a reasonable number of points lying on the pipe perimeter, depending on the resolution of the source images. The spatial resolution of the pipe perimeter defines the resolution along the pipe axis as well if square pixels are assumed. Every point p can then be projected into the corresponding fisheye image to acquire the color information at the location I(u, v). Usually, the projection will hit the image at subpixel positions, therefore image interpolation is needed.
Subsequently, the cylindrical unwrap for every fisheye image can be calculated for a given camera pose. In the existing commercial system, the camera is assumed to lie on and move along the pipe axis. This assumption is not valid due to the movement between the shots – major stitching artifacts are the consequence, making it nearly impossible to use these images for training of neural networks for automatic damage detection.
2.2. Camera Pose Estimation
We chose a feature based approach for the estimation of the camera pose. In , features are detected and filtered based on the images generate by back-projection. Due to the unknown camera pose, fairly strong artifacts may occur, causing less features or false matches. Due to the relatively homogeneous pipe surface and the limited image quality, we use an iterative feature matching scheme  that considers neighborhood constraints and leads to more and more robust feature correspondences.
Figure 2: Scheme illustrating the back-projection of the point on the image plane, resulting in point . Interpo- lation is done at to retrieve the color for .
Local Pose Optimization Utilizing (1), describing the characteristics of the fisheye lens, a vector
can be constructed for every feature point to describe the direction of the corresponding line of sight. With respect to a global coordinate system placed on the symmetric axis of the pipe, the position and orientation of the robot camera can be specified by a translation vector t and a rotation matrix R, respectively. For this pose, the straight-line equation
represents the line of sight originating from the camera center, depending on the camera position. Based on (3), the intersections of the lines with the cylinder surface of the pipe can be calculated. For matched feature points, the intersections must be consistent on the 3d surface.
Our objective is to combine the equations in a linear equation system, which can be solved efficiently for the translational () and rotational () updates. We linearize (3) with respect to the rotation angles
around an operating point of t and R, resulting in three equations, one for each component of g. With the circle equation (with r being the radius of the pipe) and the x- and y-component of the linearized straight-line equation (4), the function for specifying the intersection point can be created. So the location of a certain point on the pipe surface
can be specified by
for the first dimension. The equations for the remaining dimensions have the same structure but with and , respectively.
In our local pose estimation scheme, camera location and orientation is determined from point correspondences between two successive camera frames with unknown pose. Therefore, we always estimate pairs of 3D camera data corresponding to the two frames. To avoid pose ambiguity of the cylindrical shape, the first camera of each pair has to be fixed in its position along the z-axis and in its rotation around it, resulting in 4 + 6 = 10 unknown pose parameters. Based on the partial derivatives for the parameters around an initial operating point, a linear equation system is created and solved for every camera pair. This procedure is iteratively applied to remove errors due to linearization.
After each iteration, the parameter vector must be updated which results in a simple addition for the translation parameters. The updated rotation matrix is calculated by .
During the iterations, we utilize the RANSAC algorithm to reject outliers. This makes the pose estimation more robust against e.g. connecting pipes or dangling roots, which introduce many features violating the assumption of a cylindrically shaped pipe with a known diameter. Otherwise, the pose estimation would likely fail at these points.
Global Pose Optimization In the local optimization used for initialization, camera pose is estimated independently for each pair of frames. To get a smooth camera path, a global optimization of the camera poses is done, since pose of the current frame directly depends on pose of the previous frame. Every camera but the first has therefore six degrees of freedom and the pose is determined by the location of the matched features for both pairs the image is involved in. The first frame only has four degrees of freedom due to the ambiguity mentioned above. The matrices for K frames are combined into one big sparse matrix with unknowns for which the linear equation system is solved iteratively. Therefore, all camera parameters are connected throughout the whole camera path and can influence each other during the optimization. Due to the initialization with the results of the local estimation, this equation system can be solved quickly by exploiting its sparse nature. With the estimated camera poses in place, the back-projection of the unwrapped image can be calculated without artifacts caused by unconsidered camera movement.
Figure 3: Single unwrapped fisheye image after camera pose estimation. There is a clearly visible light falloff with increasing distance to the camera.
2.3. Image Enhancement and Stitching
Despite the removal of geometric artifacts, caused by the camera motion, there are still major artifacts caused by the uneven lighting of the images. The radial light falloff is easily noticeable in Figure 1 and Figure 3 causing a leap in lighting from dark to bright at the seam between two images. To get a smooth imperceptible transition between consecutive images, we apply these three steps:
1. Estimation and elimination of uneven lighting.
2. Identification of the optimal seam using Dynamic Programming.
3. Application of Poisson Blending in the transition zone.
Elimination of uneven lighting In contrast to , the illumination is modeled separately for each image. The authors of  assume that the average grey level of a distinct pixel can be regarded as the illumination intensity. This assumption is only valid if the reflection properties and color remain the same over the entire length of the pipe. In our use case though, we have to deal with changing materials, and deposits on the surface affecting the reflection of light. Therefore, we modulate the lighting falloff for each image separately as a linear function of the distance to the camera.
The lighting estimation is done on the grey-scale images G(u, v) in several steps. To separate the low frequencies, a fairly strong Gaussian filter is applied. This practice helps to reduce the influence of locally strong reflections or dark spots, like pipe connections. After that, a linear function is fitted to every image column v. We use a robust regression method by fitting a linear function iteratively to the image while trimming out image areas with the biggest residuals. In every image column v, we also calculate the median value from all grey values to get an estimate for the offset. We then adjust every channel of I with
getting the new corrected image . This step is applied to all three color channels of I(u, v).
Optimal Seam On behalf of a smooth transition between consecutive images, it is advantageous to place the image seam where the image difference is small. For that purpose, an optimal path is computed by dynamic programming. With two consecutive images and overlapping along the pipe direction within the regions and , we use the normalized, absolute difference
for the grey-values as error criterion.
To avoid a frayed seam, we add an additional cost to penalize a transition from one possible seam element s(u, v) to one in the next column s(u, v + 1) by , favoring horizontal cuts through the images. For every element in the current column with index v the cost for getting there is calculated by
with being the height of the difference image. The weights and control the influence of the difference image and the vertical distance respectively. With the optimal seam at hand, the transition between two consecutive images takes place where their difference is minimal, while areas with tiny registration errors get excised.
Poisson Blending As final step of the image refinement, we apply Poisson Blending  along the image seams. The blending is formulated as an Least-Squares problem, constrained by the previous image at one side and by the current image at the other. Thereby, the image gradients can be preserved, but leaps are avoided. We limit the range of the blending to only a few pixels, to preserve computation time.
Automatic annotation and damage classification of the enhanced images is treated as a semantic labeling problem which is tackled using deep convolutional neural networks. As mentioned in section 1, many algorithms rely on image processing approaches and heuristics to detect defects in sewer pipes. Although those algorithms are able to detect some defects reliably, a common problem is the relatively low amount and quality of data available. On the one hand, this constrains the number of detectable defect classes, due to the lack of examples per class, whereas on the other hand, the varying visual appearance of those classes makes it nearly impossible to find a single general detection method on that few examples.
We think that given a sufficient amount of high quality labeled data, a deep convolutional neural network can learn to detect and differentiate between a variety of differ- ent classes with almost expert-like accuracy.
3.1. Data Acquisition
In order to successfully train deep neural networks, large amounts of data are needed. Additionally in the case of semantic labeling, this data must also be labeled in a pixel-precise way, meaning that each pixel must be assigned to one specific class.
Given the enhanced, unrolled image produced by the method in section 2 and the expertise of specially trained experts, we were able to produce such pixel-precise labeling for 111 sewer pipes covering almost 4.6 kilometers. We selected the pipes to represent a variety of materials (61 stoneware and 50 concrete) and diameters (200 to 500 millimeters). The number of pixels representing the circumference was set to 1200 and the resulting image resolution was computed accordingly. Overall, we manually annotated 1200x7123693 pixels of unwrapped pipe images. Since each single image is too large to be used for training as a whole, we decided to split them into equally sized, overlapping chunks of size 600x1200.
For the annotation of the images, we selected some of the most common defects as well as structural elements. Overall, 9 classes were used. Regarding defects the classes are residue, crack, root, obstacle and erosion/spalling, whereas for the structural elements we used joint, connection and shaft. A labeled example image can be seen in Figure 4.
Figure 4: Pixel-precise labeling (b) of a sewer pipe (a). Classes are: root (yellow), crack (magenta), residue (blue), spalling (orange), connection (green), joint (red).
3.2. Network Topology
The network structure we used to perform semantic labeling is based on the Full-Resolution Residual Networks (FRRN) by Pohlen et al. . In their work, the authors develop a novel topology of deep convolutional neural network aimed at semantic labeling tasks.
The principle idea of this structure is to have two data streams through the whole network. One stream is re-
Table 1: Number of objects and number of pixels within each class. The last column shows the fraction of the class specific pixels to the number of pixels overall.
Figure 5: General structure of an FRRN. The recognition stream (red) undergoes down-sampling and feature extraction, whereas the full resolution residual stream (blue) collects all subsequent features. Taken from .
sponsible for object recognition, which undergoes a classical pipeline of feature extraction and down-sampling to learn robust features for object recognition, whereas the other stream is kept at full input resolution to learn features for precise object segmentation. Successively features learned by the recognition (pooling) stream are up-sampled and fused into the segmentation (residual) stream. This way, even more complex features are generated for the fi-nal pixel-wise labeling. The general structure of an FRRN can be seen in Figure 5. We adapted the original FRRN structure to better fit our problem and to reduce complexity, as well as computation time and model size. Compared to FRRNA structure in , we changed the number of full resolution residual units (FRRU) per resolution level to 3 and the number of filters to 24 for the pooling stream and 16 for the residual stream.
For training, the data generated in section 3.1 was split into a training and a test set using a ratio of 80:20. Due to the large image size, training was performed on downscaled versions of the data with a size of 256x512. This enabled us to decrease training time at no cost in terms of quality compared to the original resolution. Furthermore, the images were converted to YCbCr space and contrast normalized in a windowed fashion to compensate for the varying materi-
Figure 6: Parts of an unwrapped pipe surface with associated camera motion paths used for back-projection. Top: commercial sewer pipe inspection software, which assumes a camera path along the pipe axis. Bottom: same section processed with our system. The camera path was estimated as described, resulting in a motion-artifact-free pipe unwrap.
als and color (e.g. of residues) and bright reflections due to wet surfaces.
To further increase the amount of training data available, minor data augmentation was applied. Since sewer pipe inspection is direction independent and the pipes are symmetrical, we randomly flipped the input images either vertically or horizontally with a probability of .
Training was performed on a single Nvidia Titan X for 72 hours using Tensorflow. We optimized a bootstrapped cross entropy as introduced in . The idea is to only take a certain percentage of pixels p into account, which are mis-classified or correctly classified with a low class probability
where is the posterior probability for image pixel i and its corresponding target class and is a threshold chosen so that K elements fall below this. We selected with p = 0.1 where N is the total number of image pixels.
Training was performed using the Adam optimizer  with a constant learning rate of . In Figure 7, the evolution of accuracy and mean-IoU can be seen. As expected, the accuracy increases rapidly because most of the images contain a large portion of background.
In this section, we first show results of our system for the generation of images depicting unwrapped pipe surfaces.
Figure 7: Evolution of accuracy an mean-IoU over training iterations on the training and validation sets.
We compare the results to images produced by a commercial sewer pipe inspection software. In the second part, we present the results and benchmarks for our damage detection algorithm.
4.1. Results – Unrolled Pipes
The images of Figure 6 illustrate the influence of the motion path estimation. A pipe section is shown on the left side, generated with the commercial sewer pipe inspection software in Figure 6a, and with our system in Figure 6c. The images on the right show the camera motion paths used for the calculation of the back projection. As depicted in Fig- ure 6b, a camera motion along the pipe axis was assumed for the generation Figure 6a. Therefore, the actual camera movement causes several artifacts in the final image. It creates foremost the illusion of a bent pipe. A not centrically placed camera is also easily to notice by the depicted wavy pipe couplings. In Figure 8, details of the pipe couplings are shown for comparison.
Figure 8: Detail views for the pipe couplings (cropped and rotated from Figure 6). Top: Typical artifact due to an incorrect camera pose estimation by the commercial sewer pipe inspection software. Bottom: The distortion free pipe coupling as a result of the camera pose estimation.
The images in Figure 9 show close up views of a pipe surface region with many cracks. In the left image, some cracks appear twice showing ghosting effects due to the lack of camera pose information. With our estimation of the camera positions, these artifacts disappear (right). Furthermore, small registration errors get excised by the estimation of the optimal path. In addition, our algorithm led to a noticeable improvement of the image resolution and contrast.
Figure 9: Detail views of cracks cropped from Figure 6. Left: Due to small registration errors some cracks appear twice and blurry. Right: The cracks are correctly registered due to camera pose and optimal path estimation.
These changes can be also noticed in the right image of Figure 10 where depositions at the pipe bottom can be inspected in much more detail. In the left image of Figure 10, produced by the commercial system, these details got lost due to the low resolution.
4.2. Results – Automatic Annotation
In Figure 11, Figure 12 and Figure 13, some exemplary labelings produced by the presented system are shown. It can be seen that structural elements can be detected and classified reliably, regardless of the pipes material. For defects however, there are some differences among the different pipe types. Depending on the material of the pipe,
Figure 10: Detail views of the pipe bottom cropped from Figure 6. Left: Result of the commercial system with strong blurring and limited contrast. Right: The result of our system with enhanced details and image contrast.
Figure 11: Labeling of a stoneware pipe. Joints, cracks and roots are reasonably detected, whereas spalling is missed. Also some dark areas are confused with joints.
Figure 12: Labeling of a stoneware pipe. Cracks are generally easily detected. Still some fine roots are mistaken for cracks due to their dark color. Connections are generally no problem.
some defects are more likely to be missed. In Figure 11, and Figure 12, cracks and roots are detected reasonably and classified as such, whereas in Figure 13 the crack is missed completely. Also, sometimes cracks and roots, although detected, are mistaken for each other due to their sometimes similar color. All in all, our system produces visually satisfying results.
Table 2 shows the confusion matrix and mean-IoU on
Table 2: Confusion matrix representing accuracy on a pixel level on the test set. The last column represents the mean-IoU per class averaged over all images.
Figure 13: Labeling of a concrete pipe. Due to the high roughness of the surface, cracks can easily be missed.
the test set and gives a more detailed overview of the results. It can be seen that the most problematic class is obstacle (seventh line), which most often gets confused with the background. This makes sense in that usually an obstacle that is viewed from the center of the pipe like on the unwrapped images is virtually invisible and cannot be distinguished from the background. Only in cases where there are also color changes, the obstacle can be detected. The second most problematic class is crack (fifth line) which also gets missed often, due to the complex texture of the concrete pipes. An example can be seen in Figure 13.
In general it seems that most often when a defect or structure is missed, the problem is not that it gets misclassi-fied but rather is missed at all and mistaken for background. This could be a serious problem in terms of risk assessment, which we look out to have a closer look on in future work (see section 5).
Until now, we have only used a state-of-the-art network topology that shows general good performance in semantic segmentation tasks. We plan to enhance the used structure to incorporate the shape of a pipe in the sense that the images wrap around and that the first and last row are correlated. This way we hope to reduce the number of parameters needed, while keeping the quality of the results. This in turn would reduce computation time and model size. Furthermore, we aim at increasing the resolution used for training and prediction to overcome the problem of missing thin cracks due to down-sampling.
Although mean-IoU is a good measure to get an intuition for the quality of the system, it is in a sense very academic. In terms of risk assessment and sanitation planning, detection rate and false positive rate are perhaps the most interesting measures. We plan to also evaluate our results based on those measures and to give a qualitative evaluation (e.g. in terms of risk) with the help of experts.
Regarding the image enhancement side of this work, there are two possible ways to go. On the one hand we could use greatly improved images, which would lead to trackable features. On the other hand, one can use stereo optical techniques to retrieve a 3d model of the pipe. In addition to the used defects mentioned in section 3.1, this would enable us to also detect open joints which can be seen as high risk defects due to their influence on the pipe’s structural integrity. Furthermore, it would likely improve the detection of obstacles for the reasons mentioned in section 4.2.
In this paper, we present a method for enhancing low quality fisheye images of sewer pipes and produce high quality unwraps that then are used for automatic detection and classification of defects and structural elements. We show that, given a sufficient amount of data, a single system can be capable of detecting a wide range of defects despite the large visual variations. Although there are still some problems to tackle, the system achieves good results in terms of accuracy and IoU.
 D. Cooper, T. P. Pridmore, and N. Taylor. Towards the re- covery of extrinsic camera parameters from video records of sewer surveys. Machine Vision and Applications, 11(2):53– 63, 1998. 1
 S. Esquivel, R. Koch, and H. Rehse. Reconstruction of sewer shaft profiles from fisheye-lens camera images. In Proc. DAGM, pages 332–341, 2009. 1
 S. Esquivel, R. Koch, and H. Rehse. Time budget evalua- tion for image-based reconstruction of sewer shafts. In Proc. SPIE 77240, Real-Time Image and Video Processing, 2010. 1
 J. Furch and P. Eisert. An iterative method for improving feature matches. In 2013 International Conference on 3D Vision, pages 406–413. IEEE, 2013. 2
 P. Huynh, R. Ross, A. Martchenko, and J. Devlin. Douedge evaluation algorithm for automatic thin crack detection in pipelines. In 2015 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pages 191–196, Oct 2015. 2
 J. Kannala, S. S. Brandt, and J. Heikkil¨a. Measuring and modelling sewer pipes from video. Machine Vision and Applications, 19(2):73–83, 2008. 1
 D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. 6
 P. P´erez, M. Gangnet, and A. Blake. Poisson image editing. In ACM SIGGRAPH 2003 Papers, pages 313–318, 2003. 4
 T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Fullresolution residual networks for semantic segmentation in street scenes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017. 5
 N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Ex- ploring photo collections in 3d. In ACM Siggraph, pages 835–846, 2006. 2
 T.-C. Su. Segmentation of crack and open joint in sewer pipelines based on cctv inspection images. In Proc. AASRI Int. Conf. on Circuits and Systems, May 2015. 2
 C. Wu. Towards linear-time incremental structure from mo- tion. In Proc. Int. Conf. on 3D Vision (3DV), pages 127–134, 2013. 2
 C. Wu, S. Agarwal, B. Curless, and S. M. Seitz. Multicore bundle adjustment. In Proc. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3057–3064, 2011. 2
 W. Wu, Z. Liu, and Y. He. Classification of defects with en- semble methods in the automated visual inspection of sewer pipes. Pattern Analysis and Applications, 18(2):263–276, 2015. 2
 Z. Wu, C. Shen, and A. van den Hengel. Bridging category-level and instance-level semantic image segmentation. CoRR, abs/1605.06885, 2016. 6
 M.-D. Yang and T.-C. Su. Automated diagnosis of sewer pipe defects based on machine learning approaches. Expert Systems with Applications, 35(3):1327 – 1337, 2008. 2