Computer assisted interventions (CAI) have the potential to enhance surgeons’ capabilities through better clinical information fusion, navigation and visualization [1]. Currently, CAI systems are used mainly as tools for preoperative planning [2] and translation of such plans into the procedure through surgical navigation [3], [4]. There is potential to develop CAI further with improved navigation capabilities, better imaging and robotic instrumentation [5]. More advanced CAI systems depend on effective use of the video signal, which surgeons are expected to rely on. Data-driven machine learning techniques and deep learning, in particular, have been immensely influential in recent computer vision advances as well as in medical image computing and analysis. Therefore, using surgical cameras, establishing data repositories and labels that facilitate training of vision models and subsequent benchmarking is necessary [1], [6] to exploit such advances for CAI.
Over the past decade, the emergence of surgical video datasets has significantly contributed to the fast progress of computer vision-based CAI systems. Notable examples include the Cholec80, Cholec120 [7], RMIT [8] and the EndoVis challenge datasets 1. In particular, two robotic instrument
This work was supported by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) at UCL (203145Z/16/Z), EPSRC (EP/P012841/1, EP/P027938/1, EP/R004080/1) and the H2020 FET (GA 863146). Danail Stoyanov is supported by a Royal Academy of Engineering Chair in Emerging Technologies (CiET1819/ 2/ 36) and an EPSRC Early Career Research Fellowship (EP/P012841/1).
M. Grammatikopoulou, E. Flouty, A. Kadkhodamohammadi, A. Chow, J. Nehme, I. Luengo and D. Stoyanov are with Digital Surgery LTD, 230 City Road, EC1V 2QY, London, UK (emails: {maria.grammatikopoulou, evangello.flouty, rahim.mohammadi, andre, jean, imanol.luengo, danail.stoyanov}@touchsurgery.com). G. Quellec is with Inserm, 29200, Brest - France (email: gwenole.quellec@inserm.fr). Corresponding author: Danail Stoyanov.
Fig. 1: Example image frame (left) and semantic segmentation labels (right) from the Cataract dataset for Image Segmentation presented in this paper. (Colormap: Pupil, Iris,
Secondary Knife Handle)
segmentation datasets have been released for the 2017 [9] and 2019 [10] Robotic Instrument Segmentation EndoVis sub-challenges that included segmentation masks for robotic instruments appearing in the scene. However, the background entities such as the anatomy and non-surgical objects are not annotated in these datasets. The 2017 Robotic Instrument Segmentation dataset was later extended for the 2018 Robotic Scene Segmentation EndoVis sub-challenge to include pixel-wise labels for the whole scene for approximately 2400 endoscopic images from robotic nephrectomy procedures [11]. While releasing these datasets, the research community has also worked towards standardizing the reporting of datasets and challenges [12]. Pixel-level annotations could facilitate more advanced and effective applications in CAI: to help image guided interventions [13], [14], support pre-operative surgical planning [15], estimate instrument usage or motion for post-operative analytics [16]–[18], automate diagnostic readouts [19], [20], or enhance surgical training [21]. While data availability is increasingly growing through the usage of digital surgical cameras in endoscopy, laparoscopy and microsurgery, and well-established systems for managing confidentiality, regulation and ethics, annotation and data labelling are still a major challenge for CAI.
The ability to localize anatomical structures in real-time and the interaction of surgical instruments with the anatomy are fundamental building blocks for any system aiming at computer assisted intervention and robotic automation [22]. This would enable an overall understanding of the state of a procedure and would allow for monitoring the healthy state of different anatomical landmarks at any moment. These insights could help to develop scoring systems of surgeon’s interaction with the patients’ anatomy. Moreover, semantic segmentation of instruments enables creating an accurate profile of instrument usage across an operation. Such analytics, along with data for instrument trajectories, will result in technology for both intra-operative risk estimation systems and post-operative analytics. There could also be the potential to score surgeon’s instrument handling skills and report feedback when tremor or abrupt movements are detected. Furthermore, semantic segmentation could be used in conjunction with style transfer methods for label transfer [23].
Recently, the CATARACTS challenge2 presented 50 annotated surgical videos obtained through a surgical microscope [24]. The dataset was annotated to provide both frame-level instrument presence labels and frame-level surgical phase labels [24], [25]. Even though cataract surgery is less prone to complications, risk mitigation can have big impact, with over 20 million cases recorded in 2010 [26]. In addition, a study on medical malpractice claims related to cataract surgery revealed that 76.28% of the 118 claims are intra-operative allegations [27] and another study showed that the rate of a certain intra-operative complication (posterior capsular rent) was for experienced surgeons, and
for residents [28]. With this in consideration, a dataset for semantic segmentation may lead to the development of systems that could potentially reduce risk and improve workflow.
In this paper, we introduce a semantic segmentation dataset generated from videos of the training set of the CATARACTS dataset. Our dataset follows a similar paradigm to [11] as it includes pixel-wise annotations for the entire surgical scene for cataract surgery procedures, including anatomical structures and surgical instruments, for 4670 surgical microscope images. The aim of releasing such a dataset is to allow simultaneous anatomy and instrument pixel-level localization. A potential application could be the detection of anatomy and surgical instrument interactions which can be subsequently used to assess the safety and progress of the surgical procedure. We demonstrate how this dataset can be used to train state-of-the-art deep learning frameworks to segment microscope images from cataract surgery. We believe this contribution will underpin the development of CAI techniques based on surgical vision.
The dataset was generated from the training videos released for the CATARACTS challenge [24]. The CATARACTS challenge training set includes 25 videos with average duration of 10 minutes and 56 seconds recorded at 30 frames per second (fps).
A. Data sources
The recorded operations were performed in Brest University Hospital from January to September 2015 [24]. The videos were recorded during the phacoemulsification procedure using a 180I camera (Toshiba, Tokyo, Japan) mounted on an OPMI Lumera T microscope (Carl Zeiss Meditec, Jena, Germany) focusing on the patient’s eye. The surgeries were performed by three surgeons of varying expertise levels (one expert, one mid-level and an intern surgeon). The average age of the patients was 61 years old, with a minimum of 23, a maximum of 83 years old and 10 years standard deviation. The surgeries were performed because of age-related causes, trauma and refractive errors. Each video corresponds to a different patient. The study was approved by the Institutional Review Board of Brest University Hospital on 28 January 2013. All patients were informed and gave their consent to participate in the study.
B. Training and test set characteristics
Frames from the 25 training videos were extracted using the ground truth instrument and phase information. This is to select video frames that include instruments and to ensure a class distribution across the surgical phases that represents real-world scenarios. In particular, the videos were sampled to tackle the overhead of pixel-level manual labelling for semantic mask generation in order to label as many frames, which contain substantial scene variations. The surgical procedures were divided into 14 phases as in [25]. The phases sampled per video in the presented dataset are given in Table I. A number of 10 to 20 frames were randomly selected per phase such that the frames are at least 3 seconds apart. The images were also resized from to
. In total, 4670 frames were selected.
C. Annotation process
After frame selection, the videos were annotated manually. The guidelines for anatomy and instrument annotation were drafted by an in-house expert medical officer. A team of four in-house roto artists (annotators) created the pixel-wise segmentation masks. The annotators used commercial rotoscoping software to create the segmentation masks. The annotators were trained by the medical officer in order to get familiar with the phacoemulsification procedure and the different instruments used at each phase. The annotators had direct access to the medical officer at all steps of the annotation process. Every frame was annotated by one roto artist. To ensure the quality of annotations, every annotated frame was checked by a second annotator. In case of disagreement between the annotators, the medical officer’s opinion is sought in accordance with the specified annotation guidelines. The medical officer validated the segmentation mask annotations. Further pixel-wise checks per segmentation mask were performed by programmatically extracting all contours from the generated segmentation masks and overlaying them to the respective image frame. This facilitated visual inspection of the segmentation masks to ensure accurate anatomy and instrument boundaries. In addition, pixel-wise checks were performed to ensure that all clusters of pixel larger than 50 pixels are assigned to a class. The same process of annotation was applied to all selected frames (training, validation and test set).
D. Sources of error
Potential sources of error in the annotation can be attributed to blurriness due to substantial instrument or patient motion. This contributes to having instrument or anatomy out of focus and, therefore, not have very clear boundaries in some frames.
(a) Hydrodissection cannula (b) Viscoelastic cannula(c) Capsulorhexis cystotome
Fig. 2: Instances for all instruments appearing in the dataset
However, even in this cases, it was ensured that the instrument and anatomy boundaries are as accurate as possible. Specular reflections may also lead to inaccurate boundary delineation, especially for the instrument tips when they are inside the anatomy.
E. Dataset statistics
The dataset includes 36 classes: 29 surgical instrument classes, 4 anatomy classes and 3 miscellaneous classes. The list of classes per category and the statistics of the dataset are given in Table II. As expected, the anatomy classes appear more frequently than the surgical instruments. The anatomy also covers the largest part of the scene, as it can be seen from the average number of pixels that represent the pupil,
Table I: Phases sampled per video in CaDIS dataset. Phase numbering in the table as defined in [25] The defined phases are: 1) Access of anterior chamber: sideport incision, 2) Access of anterior chamber: mainport incision, 3) Lens removal: Viscoelastic injection, 4) Lens removal, 5) Phacoemulsifi-cation: Viscoelastic injection, 6) Phacoemulsification: Capsulorhexis, 7) Phacoemulsification: Lens hydrodissection, 8) Phacoemulsification, 9) Phacoemulsification: Lens matter removal, 10) Lens insertion: Viscoelastic injection, 11) Lens insertion, 12) Aspiration of viscoelastic, 13) Wound closure and 14) Wound closure with suture
iris and cornea compared to the surgical instruments (Table II). In addition, the Presence In Videos metric shows that 17 instrument classes appear in less than half of the videos. The instance and pixel distribution indicate that the dataset is highly imbalanced and, consequently, accurate instrument classification is more challenging. Furthermore, there are other visual challenges due to the high inter-class similarity among instruments. For example, Figure 2 shows four different types of cannulas, which look very similar. Each of these cannulas are used to perform different actions, like injecting material and handling tissue. Therefore, as the type of instrument can reveal information and be one of the main indications of what surgical action has been performed, it is of interest to distinguish different instrument types.
A set of experiments were performed using the presented dataset for semantic segmentation in cataract surgery videos in three different tasks as described in the following sections. Baseline experiments were performed using state-of-the-art segmentation networks to provide a baseline for future experiments using the dataset.
A. Tasks
Three different tasks are presented that use different class grouping. The motivation for the following tasks is that anatomical structure and instrument localization could be useful for intra- and post-operative image guidance and risk assessment. Instrument segmentation and identification can be useful to a different degree. A brief description of each task is given in the following sections.
Table II: Total instances per class, total presence of class in videos and average number of pixels per class per frame for all videos and per split
Table III: Number of parameters of baseline models
1) Task I: The first task is focused on differentiating between anatomy and instruments within every frame. Therefore, the focus is less on identifying types of instruments and the purpose of this scanario is to identify mainly anatomical structures. In order to achieve that the first task includes 8 classes. In particular, it contains 4 classes for anatomical structures, 1 class for all instruments and 3 classes for all other objects appearing in the scene (Table V). This task describes scenarios in which the focus is to distinguish between different anatomical landmarks and surgical instruments without
Table IV: Mean Intersection over Union (mIoU), Pixel Accu- racy (PA) and Pixel Accuracy per Class (PAC) per model for validation and test sets for Task I
identifying the type of instrument.
2) Task II: The second task includes 17 classes given in Table VII. This task incorporates instrument classification that are grouped in categories according to appearance similarities and instrument types. This task is to identify anatomical structures and also the main types of instruments that appear in the scene. The purpose of identifying the main instrument
Table V: mIoU per class for test set for Task I
Table VI: Mean Intersection over Union (mIoU), Pixel Accu- racy (PA) and Pixel Accuracy per Class (PAC) per model for validation and test sets for Task II
type simultaneously gives more information on the stage of the procedure through scene segmentation. For distinct instrument types, the type of instrument can also help differentiating overlapping instruments in the segmentation output which would otherwise be shown as one merged area. In addition, grouping instrument classes mitigates class imbalance while also allow a degree of instrument classification in combination with anatomy segmentation. In particular, the classes that are merged are: i) hydrosdissection cannula and handle, viscoelastic cannula, Rycroft cannula and handle and Charleux Cannula as cannula and ii) Bonn and Troutman forceps as tissue forceps while all the other instrument classes were merged with the respective handle. For example, from the statistics of Table II, it can be seen that the Troutman forceps do not appear in the validation set, are in 14 frames of the test set and also have a similar appearance to the Bonn forceps. Therefore, they are merged with the Bonn Forceps. Similarly, the charleux cannula is merged with the other cannula instruments. In addition, the instruments that only appear in the training set, cover relatively few pixels in the frame and cannot be merged with another instrument class were ignored during training. The ignored classes are: suture needle, needle holder, vitrectomy handpiece, marker, cotton, iris hooks and Mendez ring.
3) Task III: The third task includes 25 classes listed in Table IX. This task allows more granular instrument classifi-cation by keeping each instrument and its respective handle as separate classes. The classes that do not appear in all splits and are present in less than 5 videos were ignored during training (Table II). Identifying all instrument types in the scene gives even more explicit information about the stage of the surgery. For example, different cannulas are used in different phases of the procedure. Therefore, the third tasks aims at combining anatomy and instrument segmentation while giving the most information about the procedure itself through identifying the
exact type of the instrument.
B. Dataset splits
The videos are separated into training, validation and test sets. The dataset distribution per set is presented in Table II. As not all classes are present in all videos, we ensured that the videos in the training set include samples from all classes. We split the rest of the videos between the validation and test sets so that sufficient instrument instances were present in each set to allow a fair assessment of models across different instrument classes. The distribution of classes in splits could also be done on a frame basis for a more uniform class distribution between training, validation and testing. However, we chose to avoid dividing frames from a single video among training, validation or test sets. The training, validation and test sets contain 3550, 534 (Videos 5, 7 and 16) and 586 (Videos 2, 12 and 22) images respectively. The dataset is imbalanced since the classes that represent instruments appear less frequently and occupy less pixels per frame than the anatomy (Table II). Three tasks are presented, in which the instrument classes are merged differently in order to assess the effect of class imbalance and to illustrate different segmentation scenarios.
C. Baseline models
The three tasks are benchmarked on state-of-the-art models to provide a baseline for semantic segmentation models for cataract surgery. The models used in the baseline experiments are UNet [29], DeepLabV3+ [30], UPerNet [31] and HR-NetV2 [32]. UNet was proposed by Ronnenberger et al. for biomedical image segmentation. It has been widely used in the medical community because of its relatively low number of parameters. DeepLabV3+ was introduced as an extension of DeepLab (v2 [33] and v3 [30]) that uses modified Xception [34] as the encoder and combines it with atrous convolutions with different dilation rates to achieve better contextual predictions without losing image resolution. The atrous convolution enables DeepLabV3+ to benefit from long-range contextual information while preserving fine boundary information. In this work, MobilenetV2 [35] is used as the backbone for DeepLabV3+ in order to use a light-weight version of the model. UperNet uses a pyramid pooling module to make use of both global and local contextual information. To extract and incorporate this information, the model relies on a Feature Pyramid Network to extract features at different scales of the encoder, which allows to build a richer representation by combining information at multiple image scales. Lastly, HRNetV2 attempts to preserve high-resolution feature representations by combining features from all scales throughout the encoder and also from parallel convolution streams. The open-source implementations of the networks were used in all experiments (UNet 3, DeepLabV3+ 4, UperNet 5, HRNetV2 6).
Table VII: mIoU per class for test set for Task II
D. Training process
1) Data pre/post-processing: Data augmentation was ap-
plied prior to model training. The same augmentation was applied for all models. Each training image was normalized, flipped, randomly rotated and hue and saturation was also adjusted. The input images were downsized to 270 480. No post-processing was performed.
2) Experiment parameters and setup: The network weights
for UPerNet and HRNetV2 were initialized using pre-trained weights on ImageNet [36] while for DeepLabV3+ pre-trained weights on Pascal VOC [37] were used. The networks were trained on a system with two NVIDIA GTX 1080 Ti GPUs for 100 epochs. For all models, the Cross Entropy loss function was used with learning rate equal to using the Adam Optimizer. The
and
values for the Adam Optimizer were set to 0.9, 0.999 and
, which are proposed as good default values for the optimizer in [38].
E. Metrics
The metrics that are used to assess the segmentation quality are the mean Intersection over Union (mIOU), Pixel Accuracy (PA) and Pixel Accuracy per Class (PAC) and the IoU per class. The formulations for PA, PAC and mIOU are defined as follows:
were N the number of classes and the number of pixels predicted as class i and labelled as class j. It is worth noting
Table VIII: Mean Intersection over Union (mIoU), Pixel Accuracy (PA) and Pixel Accuracy per Class (PAC) per model for validation and test sets for Task III
Table IX: mIoU per class for test set for Task III
that the ignored classes were not taken into account when the metrics are calculated.
F. Results
1) Task I: The overall mIoU, PA and PAC for the validation and test set for all models in Task I are given in Table IV. In particular, for anatomy segmentation of the test set, UNet presents a mIOU of 85.8 %, DeepLabV3+ of 84.5, UPerNet of 84.9 % and HRNetV2 of 85 % (Table V). Similarly for instrument segmentation, UNet gives a mIoU of 73.8 %, DeepLabV3+ 74.3 % , UPerNet of 76.4 % and HRNetV2 of 77 % (Table V).
2) Task II: The mIOU, PA and PAC for the validation and test set are shown in Table VI. The mIoUs for anatomy segmentation are 85.4 %, 83.9 %, 84.9 % and 86.3 % for UNet, DeepLabV3+, UPerNet and HRNetV2 respectively (Table VII). For instrument segmentation for Task II, the ioUs
Fig. 3: Example frames with ground truth segmentation and model predictions for Task I. (Colormap: Pupil, Iris,
Fig. 4: Example frames with ground truth segmentation and model predictions for Task II. (Colormap: Pupil, Iris,
per class are 60.9 %, 64.6 %, 65.5% and 68.6 % for UNet, DeepLabV3+, UPerNet and HRNetV2 (Table VII).
3) Task III: The results for Task III for the validation and test set are given in Table VIII. Including now all tips and instrument handles as separate classes, for instrument segmentation the mIoUs for UNet, DeepLabV3+, UPerNet and HRNetV2 are 52 %, 58.2 %, 63 % and 62.5 respectively (Table IX). For anatomy segmentation, the ioUs per class are similar
to the previous tasks (Table IX).
G. Discussion
1) Task I: The mIOU for all models of Task I is comparable for the classification into 8 classes with HRNetV2 presenting the highest mIOU and DeepLabV3+ the lowest for both validation and test sets (Table IV). The small differences in the mIOU between the models is because the imbalance among
Fig. 5: Example frames with ground truth segmentation and model predictions for Task III. (Colormap: Pupil, Iris,
the classes is reduced by representing all instruments with one class.
2) Task II: The differences in the capacity of each network for simultaneous anatomy segmentation and instrument classification are further highlighted as the number of classes increases. (Table VI). It can be seen that all networks achieve a high mIOU for large classes, such as the anatomical classes and instrument classes that are represented by large number of pixels (Table VII). For the instrument classes as appearing in the test set, UNet has the lowest mIOU with 60.9 % and HRNetV2 the highest with 68.56 %. It is worth noting that the cannula group of classes and the capsulorhexis cystotome have a relatively low mIoU (Table VII). This is due to the fact that these classes are represented by a small number of pixels and a large percentage of them is classified as anatomy around the boundary, despite a visually accurate segmentation (Figure 4). This is also verified by the confusion matrix of HRNetV2 (Figure 6). Similar results occur for the capsulorhexis cystotome and the micromanipulator, which are all examples of fine instruments represent by a small number of pixels relative to the anatomy classes (Figure 4). Lastly, it is noted that the capsulorhexis forceps present a low mIoU for all models and this is because they are frequently misclassified as tissue forceps (Figure 4). This is shown by the confusion matrix for HRNetV2 given in Figure 6. These two classes could have been merged into one group, however as the capsulorhexis forceps appear in all splits with sufficient number of training images, they are represented by a separate class.
3) Task III: The mIoU per class given in Table IX shows that the instrument handles are classified with varying degrees
of accuracy (Figure 5). This is because as the number of classes increases, class imbalance is more evident.
4) All Tasks: Throughout all tasks, for all models, we observed that PA and PAC are higher than the mIOU. The reason for PA being higher is class imbalance as pixels from the anatomical classes are significantly more than the pixels that represent instruments (Tables IV, VI and VIII). The difference between PA and mIOU becomes smaller after classes are merged, however imbalance still exists. Therefore, PA is dominated by the classes with higher number of pixels and therefore it cannot reflect performance change of instruments where the number of instances is lower. The most descriptive metric is the mIoU as it calculates the overlap between the pixels for each class of the ground-truth and the predicted segmentation masks. The mIoU per class is an even more insightful metric as it assesses the performance of the model in segmentation of specific classes and can serve as a direct comparison among the classes of interest for each model.
There is a consistent difference between the mIOU for the validation and test sets as can be seen in Tables IV, VI and VIII. This can be explained by the distribution of class instances in each set, despite the attempt to have a similar distribution of instances at each set of videos, there is a variance in the distribution of instrument classes. This justifies further the choice of ignoring classes that do not have sufficient instances in the test set and are not present in the validation set.
Overall, DeepLabV3+, UPerNet and HRNet achieve higher mIOU for instrument segmentation and classification than UNet (Tables V, VII and IX). In particular, UNet achieves a mIOU over 85% for anatomy segmentation in all Tasks
Fig. 6: Confusion matrix for HRNetV2 on test set of Task II
but gives a lower mIoU at instrument segmentation and classification. This difference in performance is smaller when the type of instrument does not need to be identified (Table V) but is more evident when instrument classification is performed (Tables VII and IX). UPerNet and HRNet have the higher mIoU at simultaneous anatomy segmentation and instrument classification. Figures 3, 4 and 5 also outline that the boundaries of the segmented areas are smoother and less noisy than DeepLabV3+ and UNet. It is also worth noting that DeepLabV3+ was trained using a MobileNetV2 backbone. This was to assess the performance of a light-weight version of the network. It performs more accurate instrument segmentation than UNet as the mIoU for instrument classes for all tasks highlights (Tables V, VII and IX).
Semantic segmentation of a surgical scene can improve understanding of the workflow of a surgical procedure and is crucial for intra-operative image guidance. In this paper, we present a dataset for semantic segmentation of images from cataract surgery procedures. The dataset consists of 4670 labelled images, which are sampled from the training set of the CATARACTS challenge dataset. The dataset labels include 36 classes and, in particular, four classes describing anatomical structures, 29 surgical instrument classes and three classes for other objects appearing in the surgical scene. The statistics presented for the dataset illustrate that the dataset is imbalanced, as the surgical instrument classes appear less frequently and are represented by a smaller number of pixels compared to the anatomy classes. Three sets of tasks were performed using the UNet, DeepLabV3+, UPerNet and HRNetV2 deep learning models. Each task presents different groups of instrument classes in order to assess the effect of simultaneous instrument classification to the segmentation output. It was shown that the four networks perform similarly for a relatively small number of classes with comparable number of pixels, addressing the imbalance issue. As the number of classes increase, HRNet and UPerNet perform better in simultaneous anatomy segmentation and instrument classification than DeepLabV3+ and UNet, as HRNetV2 and UPerNet have a larger receptive field and are more capable of segmenting finer features. The mIoU per class metric reveals that UNet performs well in segmenting large areas such as the anatomical structures while DeepLabV3+, UPerNet and HRNet provide more consistent instrument segmentation and classification in all performed tasks. The aim of introducing a dataset for semantic segmentation in cataract surgery is to facilitate further development of computer-assisted strategies for image guidance.
We would like to thank Ellie Jaram, Nunzia Lombardo, Fanni Demeter and Hannah Bradd for their efforts in annotating the dataset to the highest quality.
This work was supported by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) at UCL (203145Z/16/Z), EPSRC (EP/P012841/1, EP/P027938/1, EP/R004080/1) and the H2020 FET (GA 863146). Danail Stoyanov is supported by a Royal Academy of Engineering Chair in Emerging Technologies (CiET18196) and an EPSRC Early Career Research Fellowship (EP/P012841/1).
[1] L. Maier-Hein et al., “Surgical data science: Enabling next-generation surgery,” Nature Biomedical Engineering, vol. 1, pp. 691–696, 2017.
[2] C. Zeng et al., “A combination of three-dimensional printing and computer-assisted virtual surgical procedure for preoperative planning of acetabular fracture reduction,” Injury, vol. 47, no. 10, pp. 2223–2227, 2016.
[3] M. Tonutti et al., “A machine learning approach for real-time modelling of tissue deformation in image-guided neurosurgery,” Artificial intelligence in medicine, vol. 80, pp. 39–47, 2017.
[4] M. Kaus et al., “Automated segmentation of mr images of brain tumors,” Radiology, vol. 218, no. 2, pp. 586–591, 2001.
[5] Y. Kassahun et al., “Surgical robotics beyond enhanced dexterity instrumentation: a survey of machine learning techniques and their role in intelligent and autonomous surgical actions,” International journal of computer assisted radiology and surgery, vol. 11, no. 4, pp. 553–568, 2016.
[6] S. S. Vedula et al., “Objective assessment of surgical technical skill and competency in the operating room,” Annual review of biomedical engineering, vol. 19, pp. 301–325, 2017.
[7] A. P. Twinanda et al., “Endonet: a deep architecture for recognition tasks on laparoscopic videos,” IEEE transactions on medical imaging, vol. 36, no. 1, pp. 86–97, 2016.
[8] R. Sznitman et al., “Data-driven visual tracking in retinal microsurgery,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2012, pp. 568–575.
[9] M. Allan et al., “2017 Robotic Instrument Segmentation Challenge,” arXiv preprint arXiv:1902.06426, 2019.
[10] T. Ross et al., “Robust Medical Instrument Segmentation Challenge 2019,” arXiv preprint arXiv:2003.10299, 2020.
[11] M. Allan et al., “2018 Robotic Scene Segmentation Challenge,” arXiv preprint arXiv:2001.11190, 2020.
[12] L. Maier-Hein et al., “Bias: Transparent reporting of biomedical image analysis challenges,” Medical image analysis, vol. 66, p. 101796, 2020.
[13] I. Kovler et al., “Haptic computer-assisted patient-specific preoperative planning for orthopedic fractures surgery,” International journal of computer assisted radiology and surgery, vol. 10, no. 10, pp. 1535– 1546, 2015.
[14] M. a. o. Pfeiffer, “Learning soft tissue behavior of organs for surgical navigation with convolutional neural networks,” International journal of computer assisted radiology and surgery, vol. 14, no. 7, pp. 1147–1155, 2019.
[15] F. Ozdemir and O. Goksel, “Extending pretrained segmentation networks with additional anatomical structures,” International journal of computer assisted radiology and surgery, pp. 1–9, 2019.
[16] L. C. Garc´ıa-Peraza-Herrera et al., “Real-time segmentation of non-rigid surgical tools based on deep learning and tracking,” in International Workshop on Computer-Assisted and Robotic Endoscopy. Springer, 2016, pp. 84–95.
[17] F. Fuentes-Hurtado et al., “Easylabels: weak labels for scene segmentation in laparoscopic videos,” International journal of computer assisted radiology and surgery, pp. 1–11, 2019.
[18] M. Allan et al., “Image based surgical instrument pose estimation with multi-class labelling and optical flow,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 331–338.
[19] P. Suetens et al., “Image segmentation: methods and applications in diagnostic radiology and nuclear medicine,” European journal of radiology, vol. 17, no. 1, pp. 14–21, 1993.
[20] D. Bouget et al., “Semantic segmentation and detection of mediastinal lymph nodes and anatomical structures in ct data for lung cancer staging,” International journal of computer assisted radiology and surgery, pp. 1–10, 2019.
[21] S. Engelhardt et al., “Improving surgical training phantoms by hyperrealism: deep unpaired image-to-image translation from real surgeries,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2018, pp. 747–755.
[22] A. Kadkhodamohammadi et al., “Feature aggregation decoder for segmenting laparoscopic scenes,” in OR 2.0 Context-Aware Operating Theaters and Machine Learning in Clinical Neuroimaging. Springer, 2019, pp. 3–11.
[23] I. Luengo et al., “Surreal: Enhancing surgical simulation realism using style transfer,” arXiv preprint arXiv:1811.02946, 2018.
[24] H. Al Hajj et al., “Cataracts: Challenge on automatic tool annotation for cataract surgery,” Medical image analysis, vol. 52, pp. 24–41, 2019.
[25] O. Zisimopoulos et al., “Deepphase: surgical phase recognition in cataracts videos,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 265– 272.
[26] WHO, “Priority eye diseases,” Apr 2018. [Online]. Available: https://www.who.int/blindness/causes/priority/en/index1.html
[27] J. E. Kim et al., “Medical malpractice claims related to cataract surgery complicated by retained lens fragments (an american ophthalmological society thesis),” Transactions of the American Ophthalmological Society, vol. 110, p. 94, 2012.
[28] A. Chakrabarti and N. Nazm, “Posterior capsular rent: Prevention and management,” Indian journal of ophthalmology, vol. 65, no. 12, p. 1359, 2017.
[29] O. Ronneberger et al., “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[30] L.-C. o. Chen, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
[31] T. Xiao et al., “Unified perceptual parsing for scene understanding,” in European Conference on Computer Vision. Springer, 2018.
[32] K. Sun, et al., “High-resolution representations for labeling pixels and regions,” arXiv preprint arXiv:1904.04514, 2019.
[33] L.-C. Chen et al., “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[34] F. Chollet, “Xception: Deep learning with depthwise separable convolu- tions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
[35] M. Sandler et al., “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
[36] J. Deng et al., “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[37] M. Everingham et al., “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
[38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.