David C. Christiani and Xinan Wang both work on the Boston Lung Cancer Study. By Courtesy of Xinan Wang

‘A Million Data Points’: A 30-Year Long Lung Cancer Study Meets AI

This is the Boston Lung Cancer Study, a long-running study of lung cancer patients that analyzes the disease’s genetic and environmental risk factors. But in recent years, the study reached a new frontier in medicine. The Harvard Artificial Intelligence in Medicine program has begun analyzing the dataset in unprecedented ways — by using artificial intelligence.

By Richard Y. Zhu

One study. More than 12,000 patients. Countless variables in age, gender, lifestyle, smoking history, genomics, tumor samples, and more. A dynamic team of public health researchers, physicians, and computer scientists.

The BLCS began in 1992, when David C. Christiani, a professor at Harvard School of Public Health and a physician at Massachusetts General Hospital, started recruiting a cohort of patients to study the interplay of genetics and the environment in lung cancer.

“There were some very strong connections between environment and disease that begged clarification between Nature and Nurture,” Christiani said in an email.

Since 1999, he has worked alongside Bruce E. Johnson ’75, an oncologist at Dana-Farber Cancer Institute, to expand the BLCS patient cohort. “I believe this is the comprehensive lung cancer cohort,” Johnson wrote in an email.

Christiani and Johnson have built a dataset of thousands of patients, which encapsulates millions of variables, including genomes, lung scans, tumor samples, patient outcomes, and smoking histories. Given this volume of data, scientists have turned to artificial intelligence to more quickly and efficiently synthesize the information.

Hugo Aerts, the director of Harvard’s AI in Medicine initiative, and Raymond H. Mak, a radiation oncologist at DFCI and Brigham and Women’s Hospital, have spearheaded efforts to apply artificial intelligence to the BLCS dataset.

The BLCS began in 1992, when David C. Christiani, a professor at the School of Public Health and a physician at Massachusetts General Hospital, started recruiting a cohort of patients to study the interplay of genetics and the environment in lung cancer. By Courtesy of David C. Christiani

In a recent study published in the Lancet Digital Health, joint last authors Aerts and Mak applied AI algorithms to CT scans of BLCS patients. Their study tested the ability of the algorithms to identify and outline tumors and cancer-affected lymph nodes on the scans. This is a task usually performed by radiation oncologists — physicians who specialize in administering radiation therapy — but it is a labor-intensive process subject to error and inconsistency between physicians.

To outline cancerous tissue, a CT scanner first captures dozens of two-dimensional images of the patient’s lungs. The radiation oncologist then circles the images of tumors in each of those images.

“The normal workflow is these doctors — radiation oncologists that train for five years — manually draw on each individual axial image, So slice by slice, draw where the tumor is,” Mak says.

With the help of AI, Aerts and Mak saw an opportunity to dramatically reduce the time it takes to complete this task, while potentially improving consistency.

In the study, the researchers asked eight radiation oncologists to edit previously-outlined lung CT scans. For the majority of scans, the AI algorithm had outlined the tumors, while doctors had manually outlined tumors on the rest. The scientists discovered that radiation oncologists performed faster when editing AI-outlined scans as compared to human-outlined scans. The radiation oncologists also showed more consistent results between their final outlines when editing scans from the AI.

As Aerts and Mak’s study shows, one of the most powerful applications of AI is in synthesizing data so that doctors can more efficiently diagnose and treat patients using the processed data.

“I think AI can play an enormous role,” Aerts says. “You’re really thinking about better linking all the data together, so that the physician can much more quickly make decisions.”

Aerts and Mak are also applying AI to other questions in lung cancer research. For example, a patient with low muscle mass is more likely to experience toxic side effects from chemotherapy. The researchers are developing algorithms to detect fat and muscle mass on scans in order to predict a patient’s response to the treatment.

Despite the diverse capabilities of AI in medicine, Aerts and Mak acknowledge its limitations. On an “AI fact sheet” for the tumor-outlining algorithm, the researchers warn that the algorithm may fail to capture tumors less than one centimeter long. They also note the lack of racial diversity in the dataset, which may bias the algorithm.

Despite this, the patient dataset still provides millions of variables in genetic and environmental factors. Other studies by Christiani and his colleagues have used this data to make breakthroughs in personalized medicine — when a patient’s personal information is used to optimize treatment — outside of artificial intelligence.

In 2004, for example, Christiani and his colleagues used the BLCS data to test the efficacy of the lung cancer drug gefitinib for specific subsets of patients.

The results showed that gefitinib inhibited cancer growth by blocking a protein on the surface of cells called EGFR. This study led to the approval of gefitinib for usage in cancer patients with certain EGFR mutations. But it would not have been possible without the patients, tumor samples, and blood samples from the BLCS cohort, Christiani says.

Xinan Wang, a postdoctoral researcher working with Christiani at the School of Public Health, has analyzed how smoking history affects an individual’s response to immunotherapy. By Courtesy of Xinan Wang

Xinan Wang, a postdoctoral researcher working with Christiani at the School of Public Health, has looked at personalized lung cancer treatment under a different variable: smoking history.

Wang took advantage of BLCS’ rich dataset, which ranges from packs per day to years spent smoking, to analyze how smoking history affects an individual’s response to immunotherapy, a treatment approach that trains a patient’s immune system to recognize foreign or mutant proteins on cancer cells.

“Contradictory to common sense, patients who have a longer history of smoking or who’ve been a more intense smoker tend to have better clinical outcomes of immunotherapy,” Wang says. The researchers hypothesize that severe smoking leads to damaged cells with a higher quantity of mutant proteins. These mutant proteins make it easier to train the body’s immune cells to identify cancer cells.

But in using BLCS’ dataset, Wang, Christiani, and their collaborators are limited by the cohort’s lack of ethnic and racial diversity. This has created a challenge for Wang, for example, who is now looking at how ethnic ancestry affects a patient’s likelihood for having certain cancer-relevant mutations. Diversifying the data set could “potentially provide us the opportunity to learn more disparity issues around lung cancer and lung cancer patients’ survival outcomes,” Wang says.

Wang hopes more patients from different backgrounds will join the data set: “We definitely want to see more patients from the minority to join in the cohort, and we want to see more patients from different parts of the world.”