A standard image sensor is comprised of an array of Active Pixel Sensors (APS). Each APS circuit reports the pixel intensity of the image formed at the focal plane by cycling between a period of integration (wherein photons are collected and counted by each pixel detector) and a readout period (where digital counts are combined from all pixels to form a single frame). Motion detected and estimated across frames has useful applications in computer vision tasks. Unfortunately, detecting fast moving objects can be challenging due to the limitations of the integration and read out circuit. Object motion that is too fast relative to the integration period induces blurring and other artifacts. Additionally, since all pixels have a single exposure setting, parts of the scene may be underexposed while other parts are saturated. Both of these issues degrade the image quality of the captured video frames, reducing our ability to detect or recognize objects by their shapes or their motions. While high-speed cameras with very fast frame rates can resolve blur issues, they are expensive, consume lots of power, generate large amounts of data, and require adjusting exposure settings.
Event-based cameras were engineered to overcome these limitations of the APS circuitry found on conventional framing cameras. As described below, these neuromorphically inspired cameras can operate at extremely high temporal resolution (>800kHz), low latency (20 microseconds), wide dynamic range (> 120dB), and low power (30mW). They report only changes in the pixel intensity, requiring a new set of techniques to perform basic image processing and computer vision tasks—examples include optical flow [3,8], feature extraction [4,12,13], gesture recognition [2,11], and object recognition [5,14].
“Time-surface” is one such technique with proven usefulness in pattern recognition by encoding the event-time as an intensity [10]. However, time-surfaces are sensitive to noise and to multiple events corresponding to the same image edge with some latency when the intensity changes are large. Both have an effect on time-surfaces similar to the ways that blurring affects APS data. An improved time-surface technique called Filtered Surface of Active Events (FSAE) [1] was introduced in a corner detection and tracking algorithm. FSAE yields an improved time-surface by only utilizing the initial event of a series—effectively removing events corresponding to the same edge. Yet, while FSAE is shown to be very effective for representing simple features such as corners, object classifi-cation tasks deal with significantly more complex objects.
In this work, we propose IETS, aimed at extracting noise-robust, low-latency features that correspond to complex object edge contours over a temporal window. IETS extends FSAE to achieve higher object recognition accuracy while removing over 70% of FSAE events. We verify the effectiveness of our object classification framework on multiple datasets.
1.1 Event Cameras
Each event-based camera pixel operates asynchronously with no notion of frame rate across the focal plane. Instead of a fixed integration time, pixels generate events only when the rate of detected photons varies above or below a predefined threshold. A log-based threshold gives the event camera an extreme dynamic range. If the scene is changing slowly, the sensor naturally compresses the data since few events are generated. In contrast, fast moving objects trigger events almost instantaneously—allowing object tracking within microseconds. Example event generation for a single pixel is illustrated in Fig. 1(a).
In a Prophesee Asynchronous Time-based Image Sensor, used in N-CARS [17], each event comprises a row, column, time, and polarity. Row and column are the pixel coordinates. The time entry records when the change was detected in microseconds, and the polarity is a binary value indicating if the intensity increased or decreased.
Event camera data is often noisy and requires filtering for many applications. Previous algorithms rely on the assumption that when a pixel is triggered, neighboring pixels are also activated [7,15] and large intensity changes generate multiple events at a single pixel. These assumptions motivate the use of spatial-temporal density as a way to isolate valid events from noise, but this approach fails when motion is slow (i.e. sparse valid events are removed as noise) and when noise is high (i.e. dense noise mislabeled as real events).
Fig. 1. Event Generation. (a) On a per pixel level, intensity variations trigger events at each log-scaled level crossing. The first event in a series of consecutive events is called an Inceptive Event. (b) Time-Surface generation in the presence of noise.
1.2 N-CARS Dataset
The N-CARS dataset is a large, real-world, event-based, public dataset for car classification. It is composed of 12,336 car samples and 11,693 non-cars samples (background). The camera was mounted behind the windshield of a car and gives a view similar to what the driver would see. Each sample contains exactly 100 milliseconds of data with 500 to 59,249 events per sample.
Fig. 2 shows a sequence from N-CARS; each point in the three dimensional cube (2D space, 1D time) represents a reported event. Object velocity can be inferred when this cube is viewed from the time-space plane (Fig. 2(a)), while the object shape is better identifiable from the 2D space plane (Fig. 2(b)). Spiral patterns near the rear wheel of the car highlight high-speed rotational motion—a challenging set of relevant features to preserve during dimensionality reduction.
Object classification from event data is an active area of research. There are a number of applications that require feature extraction from the raw event detection camera data in order to carry out classification tasks. Time-surface is a technique used as an intermediary step to feature extraction by reducing the spatial-temporal structure in Fig. 2 to a two dimensional image representation. More specifically, let E denote a set comprised of events generated by an event detection camera sensor of frame size :
where [1, ..., N] and
[1, ..., M] represent the pixel coordinates in the frame;
is the event polarity; and
is the time of the event in microseconds. Additionally, let T be an ordered set of event times for a single pixel (x, y) with polarity p be defined as:
Fig. 2. N-CARS Dataset Example. (a) 3D plot of event data colored by time. (Blue/old to green/new). (b) Same data viewed under different orientation.
Then the time-surface for each pixel (x, y) with polarity p is defined as [10]:
Variations to time-surface can be implemented by replacing the “mean” operator in (3) with minimum, maximum, median, etc.
Time-surface has been used successfully in object recognition tasks. For example, Hierarchy of Time-Surfaces (HOTS) [10] utilized straightforward time-surfaces for feature generation, but it did not attempt to limit the impact of noise directly, instead relying on clustering. While this method performed well on simple shapes like numbers and letters, it does not extend well to more complex-shaped objects with wider variations (like cars).
The Histograms of Averaged Time-Surfaces (HATS) algorithm [17] localizes the motion vector representation for a specific region of the sensor (cell) using a region-based time-surface. This improved robustness to noise by averaging across the reported times of the events within each cell. A major disadvantage to HATS is the loss of fine spatial features, which is exacerbated by the low sensor resolution of current event cameras.
FSAE is a method to directly improve time-surface by eliminating redundant events[1]. The FSAE filter is defined as:
where is a pre-defined threshold. Intuitively, events occurring in succession typically correspond to the same edge, and so redundant events can be eliminated by discarding events that are not temporally separated from prior events.
To advance object classification using event data, we propose a novel concept called Inceptive Event Time-Surfaces (IETS). IETS is an extension of FSAE aimed at improving dimensionality reduction and noise robustness. IETS retains features critical to object classification (i.e. corners and edges) by fitting time-surfaces to a subset of events. Unlike previous approaches that focused on generating handcrafted features from noisy event data, IETS uses deep convolutional neural networks (CNNs) to learn features from time-surface images with less noise. As demonstrated by the experiments using the N-CARS, IETS combined with CNNs achieves a new state-of-the-art in classification performance.
We begin by the observation that a single log-intensity change often trigger multiple events in temporal sequence. As shown by Fig. 1(a), the first event indicates an “arrival” of an edge, which we refer to as an “inceptive event” (IE). Intuitively, IEs describe the shape of the moving object within the scene. On the other hand, the subsequent events correspond to the magnitude of the log-intensity change, which we refer to as “scaling events.” As such, edge magnitude as indicated by successive scaling events do not necessarily describe the edge shape well. The comparison between inceptive and scaling events in Fig. 1(a) make this clear. While scaling events are more useful for intensity-based inferences, the effect the latency (relative to the edge arrival) has on the time-surface is similar to image blur. Furthermore, scaling events are subject to degradation by two hardware designs: a low-pass filter and a regulated “refractory period”— a period of time after an event trigger that a pixel must wait before triggering again (due to the limitations of read out and reset circuits).
Object detection tasks require a clear representation of the object boundaries that define the shape of the object-of-interest. Recall (2). To successfully filter events prior to time-surface generation, we propose the following:
where and
are predefined threshold parameters. One may notice that by comparing (5) to (4) that
, meaning there are necessarily fewer IEs than FSAE events. The proposed IET S is then defined as a time surface constructed from IE:
We propose to carry out the object classification by training a CNN on IETS surfaces. There are three input image channels to the proposed CNN. First two input channels are IETS surfaces of both polarities: IET S(x, y, +1) and 1), which are mapped to images of 8-bit intensity values. The third input channel is generated based on a simple count of unfiltered events (i.e. E(x, y)) at each pixel. This channel can improve machine learning by acting as a weight for the other channels. All channels are scaled from 0 to 1, and pixels with no events in the entire dataset are set to zero. With
, IETS removes over 85% of events in N-CARS. Discriminating noise from real events
Fig. 3. Time-Surface Visualization. (a) Noisy 2D time-surface (bottom) compiled from events represented as a 3D mesh (above) (b) Same visualization constructed from subset of
FSAE events. (c) Same visualization constructed from subset of
IETS events. IETS shows significantly less noise in time-surface, representing meaningful image features better than the unfiltered sensor events or FSAE events.
can be challenging, degrading time-surfaces significantly. Fig. 1(b) highlights the effectiveness of IETS in removing noise while accurately fitting the time-surface, compared to other methods.
Due to the extremely sparse number of events (< 1k) in some N-CARS datasets, likely captured during periods of little camera or target motion, IETS filtering occasionally makes object identification even more challenging. For that reason, if a pixel does not contain an IE, the mean time of all events for that pixel is used in its place. Although this reintroduces noise to each image, the overall classification accuracy on N-CARS improved by over 12% when mean event time for non-IE data was appended. Additional data, even if very noisy, is preferred when using deep neural networks. Fig. 3 highlights how effectively IETS can reduce dimensionality while at the same time removing noisy events.
Previous event-based features [6,10] are limited in the same way as many custom-designed descriptors. Leveraging CNNs to learn optimal features is typically a superior approach over custom-designed features. Of course, deep convolutional neural networks currently require millions of labeled images—something that does not yet exist for event cameras. Since no vast archive of labeled event camera data exist, IETS images are generated in a way that makes them optimal to utilize transfer feature learning from millions of real-world images via GoogLeNet [9,18]. IETS is highly parallelizable and quick to train since transfer learning converges rapidly. IETS generates images at the full resolution of the event camera. This means resolution, which is typically poor for event cameras, is not lost prior to classification as with algorithms employing cells.
IETS has excellent performance as all events in a given time window are processed simultaneously—removing the requirement to iterate over each event. Additionally, a non-optimal implementation of IETS processed over 100k events/sec, significantly faster than real-time requirements.
Fig. 4. (a) Example input to CNN is two IETSs (positive/negative polarity) and the event count per pixel (shown here as RGB). Examples from N-CARS dataset that were (b) correctly and (c) incorrectly labeled as ‘cars.’
Each N-CARS sample was processed into an image using IETS. Examples from IETS processing are shown in Fig. 4. Algorithm evaluation was accomplished via the standard metrics of accuracy rate and Area Under Curve (AUC).
The maximum score was produced after augmenting the training data by using IETS images that had also been flipped. The maximum accuracy score obtained by IETS was 0.973. Comparison to other state-of-the-art algorithms is shown in Table 1, and is a considerable improvement over the HATS published score of 0.902. AUC also improved from 0.945 to 0.997. To ensure performance gains were not entirely from replacing the Support Vector Machine (SVM) with a CNN, HATS features were used to train the same GoogLeNet architecture. These results are also included Table 1 as HATS/CNN. Additionally, to show the improvement IETS offers in generating a time-surface, FSAE images were used to train the architecture and are also included for comparison.
Table 1. Classification results on N-CARS.
To further test the results from IETS, an IniVation Davis Dynamic Vision Sensor (DVS) 240C was used to collect cars driving near the University of Dayton. This dataset was significantly different in the fact targets were acquired using a camera from a different manufacturer, at a further range, images were uncropped, and the camera was stationary. The vehicles collected were side on as shown in Fig. 5. Seven datasets were recorded with durations ranging from 2.76 to 8.30 seconds—resulting in 5,236 samples. Using four datasets for training and
Fig. 5. Three IETS images generated from data collected near the University of Dayton used for additional testing. Data included multiple cars, buses, and trucks.
three for testing resulted in a classification accuracy of 0.9951 and AUC score of 0.9999. Although the dataset proved less challenging, the results indicate that supplementing with additional variation in sensor models, viewing angles, and camera positions will allow the algorithm to extend to more general use cases.
Overall, there are a wide range of future applications for event-based sensors due to their speed, size, low memory requirements, and high dynamic range. This paper presents an algorithm that improves state-of-the-art performance for object classification of cars. As classification rates near 100% for the N-CARS, the lack of large labeled datasets will limit advancement in this area. Multiple simulators now exist for generating synthetic data [16] [13], which have been used successfully in several papers for testing. Although these simulators may be useful in the short term, real-world data is always preferred as noise, calibration, and manufacturing defects are challenging to reliably simulate.
Two limitations of IETS should be addressed with future work. First, IETS relies on the fact that edges triggering events rarely generate large, overlapping time-surfaces within 100 milliseconds. This may not be true for all scenarios. For example, a spinning fan, pulsing light, or very fast moving object would generate overlapping surfaces and likely limit the utility of IETS in these cases. The IETS algorithm currently averages overlapping surfaces, but this is not optimal as these unique signatures are undetectable to a standard camera. Second, after the time-surfaces are generated from IEs, no effort is made to recover data originally filtered as noise. A two-stage filter design will help recover events and allow for a broader application of the algorithm.
1. Alzugaray, I., Chli, M.: Asynchronous corner detection and tracking for event cam- eras in real time. IEEE Robotics and Automation Letters 3(4), 3177–3184 (2018)
2. Amir, A., Taba, B., Berg, D., Melano, T., McKinstry, J., Di Nolfo, C., Nayak, T., Andreopoulos, A., Garreau, G., Mendoza, M., et al.: A low power, fully event-based
gesture recognition system. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7243–7252 (2017)
3. Bardow, P., Davison, A.J., Leutenegger, S.: Simultaneous optical flow and intensity estimation from an event camera. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 884–892 (2016)
4. Barranco, F., Teo, C.L., Fermuller, C., Aloimonos, Y.: Contour detection and char- acterization for asynchronous event sensors. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 486–494 (2015)
5. Barua, S., Miyatani, Y., Veeraraghavan, A.: Direct face detection and video recon- struction from event cameras. In: 2016 IEEE winter conference on applications of computer vision (WACV). pp. 1–9. IEEE (2016)
6. Clady, X., Maro, J.M., Barr´e, S., Benosman, R.B.: A motion-based feature for event-based pattern recognition. Frontiers in neuroscience 10, 594 (2017)
7. Czech, D., Orchard, G.: Evaluating noise filtering for event-based asynchronous change detection image sensors. In: 2016 6th IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob). pp. 19–24. IEEE (2016)
8. Haessig, G., Cassidy, A., Alvarez, R., Benosman, R., Orchard, G.: Spiking optical flow for event-based sensors using ibm’s truenorth neurosynaptic system. IEEE transactions on biomedical circuits and systems 12(4), 860–870 (2018)
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)
10. Lagorce, X., Orchard, G., Galluppi, F., Shi, B.E., Benosman, R.B.: Hots: a hier- archy of event-based time-surfaces for pattern recognition. IEEE transactions on pattern analysis and machine intelligence 39(7), 1346–1359 (2017)
11. Lee, J.H., Delbruck, T., Pfeiffer, M., Park, P.K., Shin, C.W., Ryu, H., Kang, B.C.: Real-time gesture interface based on event-driven processing from stereo silicon retinas. IEEE transactions on neural networks and learning systems 25(12), 2250– 2263 (2014)
12. Mitrokhin, A., Ferm¨uller, C., Parameshwara, C., Aloimonos, Y.: Event-based mov- ing object detection and tracking. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1–9. IEEE (2018)
13. Mueggler, E., Rebecq, H., Gallego, G., Delbruck, T., Scaramuzza, D.: The event- camera dataset and simulator: Event-based data for pose estimation, visual odometry, and slam. The International Journal of Robotics Research 36(2), 142–149 (2017)
14. Orchard, G., Meyer, C., Etienne-Cummings, R., Posch, C., Thakor, N., Benosman, R.: Hfirst: a temporal approach to object recognition. IEEE transactions on pattern analysis and machine intelligence 37(10), 2028–2040 (2015)
15. Padala, V., Basu, A., Orchard, G.: A noise filtering algorithm for event-based asynchronous change detection image sensors on truenorth and its implementation on truenorth. Frontiers in neuroscience 12, 118 (2018)
16. Rebecq, H., Gehrig, D., Scaramuzza, D.: Esim: an open event camera simulator. In: Conference on Robot Learning. pp. 969–982 (2018)
17. Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., Benosman, R.: Hats: His- tograms of averaged time surfaces for robust event-based object classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1731–1740 (2018)
18. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1–9 (2015)