2.1. Data Summary
The data used to prove the application of concepts and methods detailed in this work consists of images of corn samples harvested over three separate years from typical fields in the Midwest (Table 1). Each image is a different sample as harvested from a grain combine. Such data contains real-world variations in material such as color, size, shape, orientation, and various contaminants such as leaves, twigs, broken/rotten/cob pieces, and chaff. Any material other than clean intact (unbroken) corn kernels is generally referred to as MOG (material other than grain) and represents a quality concern. In harvesting applications, knowledge of quality as is shown in this work presents an opportunity for feedback control and ultimately intelligent harvesting decisions leading to autonomy of the machine. Four separate camera systems were used to collect the images in the datasets. Each system was intended to be identical, but it is expected that tolerances due to variations in the lighting source and the camera hardware exist. The training and validation data were from 2017 & 2018, and the test data from 2015. The original/full-scale size of the images is 460x640x3.
Table 1: Relevant meta data related to the datasets used in this
work. Year and # cameras indicate the diversity of factors from
which the data originates.
2.2. BNAF VS MAF
While Block Neural Autoregressive Flows (BNAF) (De Cao, Aziz, & Titov, 2019) and Masked Autoregressive Flows (MAF) (Papamakarios, Pavlakou, & Murray, 2017) are both normalizing flows, they have significant differences and each their own advantages and drawbacks. As essentially a stack of MADEs formed to increase modeling capacity, MAF takes direct advantage of the fact that MADE (and the
probability chain rule) forms a lower triangular matrix dependence structure, and thus the determinant of the Jacobian for each flow, which is required by the change of variables procedure, is simply the product of the diagonal entries. In the case of MAF this is trivial to compute since each flow is typically just an affine transformation of the input random variables (RV), albeit using a scale and bias for each conditioned variable that are a complex function of the conditioning variables. Because the MADE architecture can be thought of as reusing the transformations on the conditioning variables for all downstream conditioned variables, MAF is a relatively efficient parameterization. BNAF on the other hand trades parameter efficiency for flexibility, while still retaining the concept of a lower-triangular dependence structure for ease of calculating the Jacobian determinant. Each variable transformation per flow is an unrestricted dense neural network function of the conditioning variables, and a monotonic neural network with regards to the variable itself. Thus, each variable per flow has a unique transformation, and is by design less parameter efficient than MAF. While flexible, this parameterization makes BNAF difficult to employ even with small images like CIFAR10 (32x32x3) since there is not enough memory available on a typical high-end GPU (~12GB VRAM). In practice this requires some kind of dimensionality reduction, although conveniently it was found that using linear techniques (e.g. singular value decomposition) help with the optimization enough to provide a net gain in anomaly detection, even though some information is thrown away in the process (Just & Ghosal, 2019).
2.3. Modeling Approach
In order to train a model that could produce a quality (anomaly) heatmap over the image, the density model was built on smaller crops from the image as depicted in Figure 1. Subsequently the trained model could be swept over the image in the same way a windowed filter would be, and a kind of heatmap of LL produced. Several experiments were performed that trained both BNAF and MAF models and examined performance of the heatmaps qualitatively (since no actual annotation was available). The experiments included image crops from full resolution and reduced (to 25% original size) images as described in Table 2. When using the full-sized image, even though the crop is relatively small compared to objects in the image, at 46x46x3 (6348) the dimensionality is very large for a typical dense-type neural model like the ones used herein. Although MAF could technically be effectively used at this level due to the efficient parameter reuse architecture of MADE that it leverages, as was done with CIFAR10 in (Just & Ghosal, 2019), BNAF did not scale as well. Instead of severely restricting the network architectures such that the entire model can be trained on a typical GPU (12GB VRAM), the procedures of (Just & Ghosal, 2019) were followed to both reduce dimensionality and achieve a potentially better optimized result. The dimensionality is reduced via SVD to 100 components for the full-sized images, but the reduced sized images could be modeled in full dimension (363). This number was not identified as ideal through extensive tuning, but simply worked well enough from previous experience and produced results well enough in this case to prove the concepts in this work. Such factors should be extensively explored and tuned prior to deployment for any application. The LL scores are not published since they are in effect meaningless for the purposes of this work due to known lack of correlation with anomaly detection performance (Just & Ghosal, 2019).
Table 2: Nominal information regarding density modelling. Both
BNAF and MAF were used at two different resolution levels.
Dimensionality reduction of the full-resolution random crops was
necessary for BNAF for the full resolution case, since otherwise the
large number of dimensions would make training on a GPU with 12GB VRAM implausible for most architectures, and was also employed for MAF for a fair comparison.
Figure 1: A green 46x46x3 pixel box is shown in the image in (a) where a crop is taken, with the actual cropped image in (b)
Because the training procedure used random crops from the images in the training set, of which a very large number of combinations was possible and duplications of the same crop very uncommon, the validation data for early stopping simply leveraged the same pipeline of images as the training data. This was an efficient use of the data since the quantity of images was not large.
Code for training (In both cases the code has been built for TensorFlow
2.0) • MAF: https://github.com/johnpjust/MAF_GQ_images_tf20. • BNAF: https://github.com/johnpjust/BNAF_GQ_images.
While it is emphasized that the solution presented to the quality estimation problem is an unsupervised one since no annotations are available, it is also recognized that the qualitative assessment that commences during model selection and tuning is a form of supervision. There is no need to resolve this since it is rare that an algorithm would be deployed without some kind of confidence that it will succeed in the task required of it. Also, very little changes were implemented from (Just & Ghosal, 2019) to obtain the results here, thus it provides confidence this strategy is fairly robust and will work well without much tuning regardless. In order to qualitatively assess the performance, each image was overlaid with a kind of heatmap of the LL (scaled for optimal visualization as an image), and compared with the original in Figure 3 for the training/validation data and Figure 4 for the test data. Between MAF and BNAF and the two resolutions examined, there was relatively high correlation in the LL values, so note that overall the approach is fairly robust against these types of choices. The heatmap were obtained by sweeping the 46x46x3 crop window over the full-resolution image at strides of eight pixels horizontally and vertically and calculating the LL (after reducing dimensionality to 100 components via SVD) at each location using the trained model. Dark and red colors indicate lower LL, and therefore lower quality. The result produced a total of 3570 LL estimates per image using the full-resolution model. Figure 2 shows the box plots of all 3570 LL values for six representative images in the train and test datasets. The images were selected by binning the LL values by the average of the 25th and 50th percentiles for each image, and taking a representative image from each bin in order to observe the full range in quality found in each dataset. Overall the ranges of LL were very similar for the train and test sets, and resulting quality estimates comparable in each bin for Figure 3 and Figure 4.
Figure 2: Per-image box plots for representative images from the train and test sets at increments of ten for binned LL levels. The binned LL corresponds to the average of the 25th and 50th percentile for each image. The corresponding images are shown in Figure 3 for train data and Figure 4 for test data.
Observing the lowest LL bins in the train and test set in Figure 3 & Figure 4 show very different images, but closer inspection shows similar levels of quality due to different quality factors. In the training data the corn is unusually bright yellow, and contains a large amount of broken and small pieces. Conversely, the image for the lowest bin in the test data contains a large amount of trash and immature kernels. The corresponding heatmaps for each are very good but not perfect. There are some instances where quality issues such as leaves or other material other than grain (MOG) are not completely shaded in red. In other cases, some parts of kernels are shaded even though the kernel does not appear to have any obvious quality issues. This may be indicative of potential future improvements by further model & window size tuning and selection. However, these cases are not substantial and the level of the quality is clearly seen to be increasing with the binned values (left to right). Moreover, in some of the false positive cases (identifying low quality when none is observable) the algorithm may be finding nontrivial abnormalities with sizes/shapes/colors that are difficult for a human to observe (i.e., there may be underlying quality factors that aren’t as obvious as broken kernels and MOG).
Figure 3: Training Data heat maps for the box plots in Figure 2. The top image is the original and the bottom image in each case is the same image overlaid with a semi-transparent heat map. Low LL (low quality) is denoted by darker/more red shades.
Figure 4: Test Data heat maps for the box plots in Figure 2. The top image is the original and the bottom image in each case is the same image overlaid with a semi-transparent heat map. Low LL (low quality) is denoted by darker/more red shades. Some issues like the large brown leaf piece in (a) wasn’t completely identified as low-quality with dark red everywhere, which may indicate opportunity to improve results by further modeling tuning and architectural selection.
The normalizing flow models used herein have shown to be very good density models, but the neural architectures are a highly restricted form such that they are not very conducive to interpretable feature extraction. The downstream task of classifying novel/anomalous data (e.g. poor quality) can be highly useful such as in the example presented in this work, but the density models will not provide it. Instead the results from (Zhang, Isola, Efros, Shechtman, & Wang, 2018) which highlights the effectiveness of convolutional architectures at interpretable features for judging image similarity, inspire the use of typical residual connection CNN networks to do this work. (Kolesnikov, Zhai, & Beyer, 2019) find that the pre-logit layer from classification models works well with residual-connection networks. In that case they find very good results with a slightly modified residual network using fully invertible connections inspired by (Dinh, SohlDickstein, & Bengio, 2017). Although the models explored for this work only uses the more common residual connections, it is noted since it may be a good avenue to explore in future work or in other applications. The key to training the feature extraction network then lies in the target signal. In this case no labels or annotated data exist, but there does exist the quality estimate in the form of a LL from the density model, which has already shown to also correlate with features that are human-interpretable as quality issues. Since this is a regression problem and not classification though, the pre-linear layer is used, which is positioned similarly as the pre-logits layer in the layers hierarchy. To train the model then random crops were taken from the images as they were during the training procedure for the generative model, but in this case a pre-trained generative model estimates the LL from the crop, which is used as the target for the CNN model. Figure 5 shows the relative window/crop size used, which is considerably larger than the one in Figure 1. Larger window sizes were used at this stage primarily so that the visualization produced in Figure 6 would be clearer, whereas the quality estimator was earlier aiming at producing a high resolution heatmap of quality over the image. However, there was a high correlation between the median LL per image produced by smaller and larger windows, which again underscores the robustness
and generality of the overall strategy/approach.
Figure 5: The random crop location from the full-size image is shown in (a) by a green box. The same random crop is shown in (b). Note the full resolution image is shown here for clarity, but the actual images used in training and evaluation were down sampled to 25% of the original size.
The CNN architecture was inspired mostly by the Resnet V2 architecture (He, Zhang, Ren, & Sun, 2016). The ultimate goal in this part was to obtain a kind of disentangled representation in terms of high-level features that are concerned with quality, and to that end it worked very well as is shown in Figure 6. The key was restricting the number of activations in the pre-linear layer to three units for this particular application, and that is part of the weak supervision required. Using more units than that caused the features to be spread across more units and less interpretable. All three features from that layer have direct meaning corresponding to yellowness, particle size, and whiteness, which are proxies to quality factors like broken pieces, immature kernels and cob pieces, and leaf trash or rotten kernels. When this is combined with the likelihood signal it is possible to discern when a quality issue exists and then it can be classified, since in some cases the extremes of these were still healthy kernels (e.g. large, dark kernels). In this case note that blue colored points are low LL (low quality) and red is high LL (high quality).
Figure 6: A scatterplot of the pre-linear layer feature space of the CNN, colored by the LL score (scaled for visualization). Example crops from certain areas highlighted show the clustering of images based on human-interpretable factors, and the overall disentangled representation achieved in this feature space. Note that data points identified as high quality/LL (red) in the leaf/bark and rotten kernels cluster are just very large dark-colored kernels (as would be expected).
Code for training a feature extractor using a pre-trained MAF density model is available at https://github.com/johnpjust/GQC_featureExtraction.
With implications ranging from the food and drug industry, to medical instrumentation, military, and agricultural applications, it is shown that a fully label-free (unsupervised) approach utilizing artificial intelligence algorithms to estimate novelty and/or classification with high-dimensional data is not only feasible, but can be highly effective in cases where obtaining annotated data would be quite impractical. The methods detailed in this work are not in any way limited to the example shown, but could be easily and readily extended to achieve cutting edge results in applications such as heart or seizure monitoring devices, or detecting food and medicine quality or counterfeit spices with reflectance spectroscopy, or disease monitoring of crops from aerial imagery. In the example presented in this work, the semantic and granular (both spatially and on a continuous scale) labeling of the quality of grain in images was performed in an unsupervised fashion with normalizing flow deep generative models. This involved overlaying a heatmap of the scaled log-lowlihood spatially on the images, and also by utilizing point values for each image to sort by overall quality. Furthermore, it is shown that training a feature extracting convolutional neural network with the output (log-likelihood) of a pre-trained deep generative model results in a
disentangled representation in the pre-linear layer, ultimately providing a highly effective unsupervised (or at most weakly supervised) means for disentangled representation learning.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPUs used for this research.
Choi, H., Jan, E. and Alaxander, A. (2019) 'WAIC, but Why? Generative Ensembles for Robust Anomaly Detection', arXiv:1810.01392.
De Cao, N., Aziz, W. and Titov, I. (2019) 'Block Neural Autoregressive Flow', Uncertainty in Artifical Intelligence (UAI), Tel Aviv, Israel.
Dinh, L., Krueger, D. and Bengio, Y. (2014) 'Nice: Non-linear independent components estimation', arxiv:1410.8516.
Dinh, L., Sohl-Dickstein, J. and Bengio, S. (2017) 'Density Estimation Using Real NVP', International Conference on Learning Representations.
Dua, D. and Taniskidou, K.E. (2017) UCI machine learning repository.
Germain, M., Gregor, K., Murray, I. and Larochelle, H. (2015) 'Made: Masked Autoencoder for Distribution Estimation', International Conference on Machine Learning, 881-889.
Hendrycks, D., Mazeika, M. and Dietterich, T. (2019) 'Deep Anomaly Detection With Outlier Exposure', International Conference on Machine Learning, New Orleans, LA.
He, K., Zhang, X., Ren, S. and Sun, J. (2016) 'Identity Mappings in Deep Residual Networks', European Conference on Computer Vision, pp 630-645.
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S. and Lerchner, A. (2017) 'beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework', International Conference on Learning Representations.
Hyvarinen, A., Karhunen, J. and Oja, E. (2001) Independent Component Analysis,
Just, J. and Ghosal, S. (2019) 'Deep Generative Models Strike Back! Resolving Unmet Expectations', arXiv:1911.04699.
Kolesnikov, A., Zhai, X. and Beyer, L. (2019) 'Revisiting Self-Supervised Visual Representation Learning', Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach California, 1920-1929.
Lake, B.M., Ullman, T.D., Tenenbaum, J.B. and Gershman, S.J. (2016) 'Building Machines That Learn and Think Like People', Behavioral and Brain Sciences.
Locatello, F., Bauer, S., Lucic, M., Ratsch, G., Gelly, S., Scholkopf, B. and Bachem, O. (2019) 'Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations', International Conference on Machine Learning, Long Beach, CA.
Nalisnick, E., Matsukawa, A., Teh Why, Y., Gorur, D. and Lakshminarayanan, B. (2019) 'Do Deep Generative Models Know What They Don’t Know?', International Conference on Learning Representations, New Orleans, LA.
Papamakarios, G., Pavlakou, T. and Murray, I. (2017) 'Masked Autoregressive Flow for Density Estimation', Advances in Neural Information Processing Systems.
Salimans, T., Karpathy, A., Chen, X. and Kingma, D. (2017) 'PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications', International Conference on Learning Representations.
Shafaei, A., Schmidt, M. and Little, J.J. (2019) 'A Less Biased Evaluation of Out-of-distribution Sample Detectors', arXiv:1809.04729.
Zhang, R., Isola, P., Efros, A.A., Shechtman, E. and Wang, O. (2018) 'The Unreasonable Effectiveness of Deep Features as a Perceptual Metric', Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.