A major challenge of working with large-scale machine learning datasets is the difficulty of exploratory data analysis [7]. Researchers often want to immerse themselves in the data. For datasets containing thousands of images and annotations, there is no straightforward way to do this. Visualization efforts involve writing very specific programs or scripts to generate plots, such as bar charts containing counts of different categories, or sunburst diagrams showing relative proportions of different annotations. However, these visualization attempts can produce aggregated results only, thereby hiding interesting examples. Even for data cleaning and quality control purposes, manually going through each individual image and its annotation is tedious, and is prone to human error.
To overcome these challenges, we developed the VizWiz dataset browser. The VizWiz dataset originates from people who are blind, who used mobile phones to snap photos and record questions about them (e.g., “what type of beverage is in this bottle?” or “has the milk expired?”), and contains images paired with questions about the image [5]. Subsequent research has generated a variety of annotations on top the VizWiz dataset. These include: ten crowdsourced answers to each visual question [6], reasons explaining why the ten answers can differ, if they do [4]; captions for describing the images to users with visual impairments; the multitude of skills needed by an AI system to automatically answer the visual question; quality issues present in the images (since they were captured by users who could not see the photo they were capturing), and whether text is present in the image. As more and more annotations were being collected, we felt the need to view all these different kinds of rich data in a single platform, in order to get a holistic view of the information contained within these datasets.
The VizWiz Dataset Browser is a single-page webapplication built on the Linux - Apache - MariaDB - PHP (LAMP) stack. It supports searching for textual annotations, and filtering for categorical annotations. The main purpose of the tool is to view images, and search for those images using the ‘meta-data’ provided by the annotations. To scale effortlessly with an increasing variety of annotations, we decided to keep the search functionalities on the left side of the screen, in its own independently scrollable section. By not opting for a horizontal layout of the search and filter options, we can display more dynamic information above the fold. Similar design choices are employed by popular eCommerce websites which display numerous filters on their search-results page [1, 2, 3].
2.1. Visualization Section
Figure 1 shows a screenshot of the main information visualization area. The image and the textual annotations: (a) question, (b) ten answers, and (c) five captions, are displayed in their natural form, while the categorical annotations: (d) answer-difference reasons, (e) skills, and (f) quality issues, are displayed as one-dimensional heatmaps, based on how many crowdworkers (out of 5) selected a categorical label.
2.2. Summary of Results
The top portion of the visualization section shows a summary of the search results. This includes the number of total images found for the current search and/or filter query, and the range of images shown on the current page. To support minimal page loading times, we decided to show a maximum of 50 images per page. Users can choose to view the thumbnails of all the images displayed on the current page (as shown in Figure 2) by clicking on ‘Expand Summary of Images’. Clicking on a thumbnail image within the ‘Summary of Images’ section will take the user to the details-section of the image.
Figure 2: The summary section shows an overview of the different images returned for the search or filter query. Clicking a thumbnail image lets the user view the details of the image, as in Figure 1. This example was obtained by searching for the word “glass” in the question.
2.3. Searching for Images by Textual Annotations
Text searching capabilities are present for searching for words and phrases within the visual question, the ten answers, and the five crowdsourced captions. Full-text searching is powered by MariaDB relational database1. Additionally, users can search for an image using its specific file-name. These search capabilities are shown in Figure 3.
2.4. Filtering Images by Categorical Annotations
The visualization tool can be used to filter images based on the different types of categorical annotations available:
Figure 3: Different ways to search for images using textual annotations. Users can search for words and phrases within the question, the ten answers, and the five captions.
Figure 4: Filtering for images using categorical annotations. The screenshot shows the labels for the answer-difference dataset [4].
(a) answer-difference reasons, (b) skills, and (c) quality issues. This functionality proves to be useful when we want to explore relationships between the different datasets. For example, selecting DFF (Difficult Question) as an answer-difference reason, and ROT (image needs to be rotated) as an image-quality issue, we can view the specific cases where the visual questions are difficult to answer because the images need to be rotated. The filtering capabilities for the answer-difference reasons are shown in Figure 4.
2.5. Ordering of Search Results
Figure 5: Various options for ordering the search results.
The search results can be ordered (sorted) using the options shown in Figure 5. When searching for textual annotations (words or phrases in the question, answers, or captions), the result are sorted in decreasing order of the number of matched words in the annotation. ‘Diversity of answers’ orders the results based on how different the ten answers are, using the Shannon Entropy of the ten answers. For categorical annotations (answer-difference reasons, skills, quality issues, text-presence), the results are ranked based on how many crowdworkers (out of five) annotated the images using the chosen categorical labels.
2.6. Toggling Display of Annotations
Figure 6: Options to hide or show different datasets.
Viewing all the different annotations at once can be overwhelming. Often, the user may want to selectively view certain annotations (e.g., for taking screenshots). For this purpose, the ‘View’ section, as shown in Figure 6, can be used to hide or show the different datasets as desired.
In summary, the VizWiz Dataset Browser can prove to be a useful tool to search, filter, and visualize multiple large datasets. It is already being used to aid a variety of ongoing research efforts in the domains of computer vision, accessibility, and human-computer interaction. We are hopeful that future researchers who choose to work with the VizWiz dataset will find the tool useful for answering interesting research questions.
We thank the crowdworkers for providing the annotations. We thank Kenneth R. Fleischmann, Meredith Morris, Ed Cutrell, and Abigale Stangl for their valuable feedback about this tool and paper. This work is supported in part by funding from the National Science Foundation (IIS-1755593) and Microsoft.
[1] Amazon.com: Online Shopping. https://www.amazon. com. [Online; accessed 11-Dec-2019]. 2
[2] eBay Inc. https://www.ebay.com. [Online; accessed 11-Dec-2019]. 2
[3] Walmart Inc. https://www.walmart.com. [Online; accessed 11-Dec-2019]. 2
[4] N. Bhattacharya, Q. Li, and D. Gurari. Why does a visual question have different answers? In Proceedings of the IEEE International Conference on Computer Vision, pages 4271– 4280, 2019. 2, 3
[5] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, and S. White. VizWiz: Nearly real-time answers to visual questions. In Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, pages 333–342. ACM, 2010. 2
[6] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz Grand Challenge: Answering Visual Questions from Blind People. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018. 2
[7] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley Publishing Company, 1977. 1