The Challenge Image
You just need an email address to receive any images you take with the MicroObservatory robotic telescopes.If you are 12 or younger, you will need to have your parent/guardian do this activity with you.
The Challenge image
Approach: A total of 39 teams developed, validated, and tested their TC estimation algorithms during the challenge. The training, validation, and testing sets consisted of 2394, 185, and 1119 image patches originating from 63, 6, and 27 scanned pathology slides from 33, 4, and 18 patients, respectively. The summary performance metric used for comparing and ranking algorithms was the average prediction probability concordance (PK) using scores from two pathologists as the TC reference standard.
In current practice, TC is manually estimated by pathologists on hematoxylin and eosin (H&E)-stained slides, a task that is time consuming and prone to human variability. Figure 1 shows examples of various levels of TC within different regions of interest (ROIs) on an H&E stained slide. The majority of practicing pathologists have not been trained to estimate TC as this measurement was only proposed by Symmans et al.6 in 2007, and it is currently not part of practice guidelines for reporting on breast cancer resection specimens. That being said, the use of TC scoring is expected to grow because the quantitative measurement of residual cancer burden has proven effective in NAT trials. There is great potential to leverage automated image analysis algorithms for this task to
Global image analysis challenges, such as Cancer Metastases in Lymph Nodes (CAMELYON)14 and Breast Cancer Histology (BACH),15 have been instrumental in enabling direct comparisons of a range of techniques in computerized pathology slide analysis. Public challenges, in general, in which curated datasets are released to the public in an organized manner, are useful tools for understanding the state of AI/ML for a task because they allow algorithms to be compared using the same data, reference standard, and scoring methods. These challenges can also be useful for improving our understanding of how different choices for reference standards or a different performance metric impact AI/ML algorithm performance and interalgorithm rankings.
For each patch, a TC rating, ranging from 0% to 100%, was provided by the pathologist, based on the recommended protocol outlined by Symmans et al.6 Patches that did not contain any tumor cells were assigned a TC rating of 0%. The training and validation sets were only annotated by path1, whereas the test set was annotated by both path1 and a breast pathologist (path2). Both path1 and path2 had over 10 years of experience.16 Annotations were performed independently, and therefore, each pathologist was unaware of the rating assigned by the other. The distribution of pathologist manual TC ratings used as the reference standard in this challenge for the training, validation, and test sets is given in Fig. 2. The number of patches for which reference standard scores were provided was 2394, 185, and 1119 for training, validation, and test sets, respectively.
The BreastPathQ utilized an instance of the MedICI Challenge platform to conduct this challenge.20 The MedICI Challenge platform supports user and data management, communications, performance evaluation, and leaderboards, among other functions. The platform was used in this challenge as a front-end for challenge information and rules, algorithm performance evaluation, leaderboards, and ongoing communication among participants and organizers through a discussion forum.
The challenge was set up to allow participants to submit patch-based TC scores during the training and validation phases of the challenge and receive prediction probability (PK) performance feedback scores via an automated Python script. The script initially verifies that the submitted score file is valid by checking if the submitted file is formatted correctly and that all patches have a score. An invalid submitted score file was not considered part of the submission limit for the participants. The same evaluation script was used for the training, validation, and test phases. This enabled participants to validate the performance of their algorithms during development as well as familiarize themselves with the submission process prior to the test phase of the challenge.
Ensemble methods, which combine the output of multiple trained neural networks into a single output, have become a common approach for challenge participants for improving AI/ML algorithm performance. It was the same for the BreastPathQ Challenge, in which most of the teams used an ensemble of deep learning algorithms instead of limiting themselves to just a single deep learning architecture and training. In general, the ensemble method had higher PK performance than the nonensemble methods, and the top five algorithms in terms of PK all used an ensemble of deep learning architectures. The advantage of ensembles or combinations of algorithms leading to improved performance was also observed in the DM DREAM Challenge, in which the ensemble method significantly improved the AUC over the best single method from 0.858 to 0.89539 for the binary task of cancer/no cancer presence in screening mammography. Our results indicate that ensembles of deep-learning architectures can improve estimation performance in independent testing compared with single classifier implementations at the cost of additional training time and validating the multiple neural networks.
where C is the number of concordant pairs, D is the number of discordant pairs, TA is the number of ties in the submitted algorithm results, and TR is the number of ties in the reference standard. However, one of the participants in the challenge (David Chambers, Southwest Research Institute, Team: dchambers) identified a problem with τB early after the initial release of the training data. The participant found, and we confirmed through simulations, that by simply binning continuous AI/ML algorithm outputs (e.g., binning scores to 10 equally spaced bins between 0 and 1 instead of using a continuous estimate between 0 and 1) one could artificially increase the number of ties TA that an algorithm produces. Binning also impacted the number of concordant C and discordant D pairs. Based on our simulation studies, we found that binning decreased the number of concordant pairs C somewhat but also lead to a much larger decrease in the number of discordant pairs D because regions having similar TC scores are more difficult to differentiate than regions having large differences in TC in general. Binning had a relatively small impact on the τB denominator such that the overall effect was to increase τB compared with using continuous TC estimates or even smaller bin sizes. To prevent the possibility of the challenge results being manipulated through the binning of algorithm outputs, we revised our initial concordance endpoint to use the PK metric, which does not suffer from this shortcoming. Increasing algorithm ties TA by binning still impacts C and D, but the large reduction in D reduced the PK denominator C+D+TA to a larger degree than the numerator C+12TA such that binning algorithm estimates tend to reduce PK instead of improving it.
We would like to thank Diane Cline, Lillian Dickinson, and SPIE; Dr. Samuel Armato and the AAPM; and the NCI for their help in organizing and promoting the challenge. The data were collected at the Sunnybrook Health Sciences Centre, Toronto, Ontario, as part of a research projected funded by the Canadian Breast Cancer Foundation (Grant No. 319289) and the Canadian Cancer Society (Grant No. 703006). The mention of commercial products, their sources, or their use in connection with material reported herein is not to be construed as either an actual or implied endorsement of such products by the Department of Health and Human Services.
Then comes the HOLiS microscope, which operates at lightning speed to generate massive, technicolor 3D images of each section. The technique works by projecting laser light into the tissue to create a sheet of light that illuminates a very thin tilted plane, while a fast camera captures an image of the same plane. By moving the brain section at constant speed, successive images of each plane can be stacked together to form a long 3D block. The tissue is then scanned back and forth to cover its whole volume before moving onto the next section.
Adding to the mix of talent on the project is Pavel Osten, MD, PhD, a pioneer in the field of whole-brain cellular imaging and now president, founder and Chief Scientific Officer of a new start-up company. Dr. Osten was instrumental in planning the project and will provide guidance and advice on the best ways to rapidly analyze HOLiS images to find all of the cells and to map information from HOLiS scans onto established anatomical atlases of the human brain.
To build these tools, AI researchers need access to substantial volumes of imaging data annotated by expert radiologists. Data challenges engage the radiology community to develop such datasets, which provide the standard of truth in training AI systems to perform tasks relevant to diagnostic imaging.
In a challenge, researchers compete on how well their AI models perform specific tasks such as detection, localization and categorization of abnormal features according to defined performance measures. Each AI challenge explores and demonstrates the ways AI can benefit radiology and improve patient care.
Automatic detection of pulmonary nodules in thoracic computed tomography (CT) scans has been an active area of research for the last two decades. However, there have only been few studies that provide a comparative performance evaluation of different systems on a common database. We have therefore set up the LUNA16 challenge, an objective evaluation framework for automatic nodule detection algorithms using the largest publicly available reference database of chest CT scans, the LIDC-IDRI data set. In LUNA16, participants develop their algorithm and upload their predictions on 888 CT scans in one of the two tracks: 1) the complete nodule detection track where a complete CAD system should be developed, or 2) the false positive reduction track where a provided set of nodule candidates should be classified. This paper describes the setup of LUNA16 and presents the results of the challenge so far. Moreover, the impact of combining individual systems on the detection performance was also investigated. It was observed that the leading solutions employed convolutional networks and used the provided set of nodule candidates. The combination of these solutions achieved an excellent sensitivity of over 95% at fewer than 1.0 false positives per scan. This highlights the potential of combining algorithms to improve the detection performance. Our observer study with four expert readers has shown that the best system detects nodules that were missed by expert readers who originally annotated the LIDC-IDRI data. We released this set of additional nodules for further development of CAD systems. 041b061a72