Data Science in the Battle against Infectious Diseases

Using the Latest Advances in Data Science to Fight Infectious Diseases

By Payam Etminani, CEO, Bitscopic

The past few years have seen dramatic technology-enabled transformations in many areas of everyday life, including transportation (Uber), accommodation (Airbnb), shopping (Amazon), and communication (Facebook). Another area where dramatic advances have been taking place is the use of information technology to counter an enemy that has attacked human populations from long before recorded history: infectious diseases.

Every day as we go about our daily lives, thousands of specially trained epidemiologists throughout the country at the federal, state and local levels are keeping you safe as they keep their eye on what is happening with emerging infectious diseases — making sure that any potential outbreaks are spotted as early as possible and acted on. A recent example of this is the effective public health response to the Ebola epidemic in West Africa which saved countless lives and prevented the epidemic from spreading further.

One of the most dramatic shifts in recent years that is empowering epidemiologists to be more effective at their jobs is occurring due to improvements in data technologies. In the past, the old “relational” data model dictated that data had to be highly structured, and as a result treated in distinct silos. This made it difficult, if not impossible, to analyze data from multiple sources to find correlations. Epidemiologists would spend many minutes or even hours on each query they ran to get results back, which is unacceptable when you need to test dozens of hypotheses to try to understand and contain a fast-moving outbreak. (Imagine how you would feel if each one of your Google searches took 45 minutes to return!) By contrast, using newer technologies, the same queries on the same hardware can run in seconds.

Our company, Bitscopic, has been working with epidemiologists at the U.S. Department of Veterans Affairs (VA) to combine and utilize structured, semi-structured, and unstructured data to dramatically improve their work. The system can easily be scaled to handle any amount of data without bogging down, and new data sources can be added in hours or days, instead of months as was previously the case. A press release which announces and details our contract with the VA can be found here and a detailed technical white paper about Bitscopic’s Praedico platform in use at the VA can be found here.

Preventing the Spread of the Zika Virus

Recently, we worked with VA epidemiologists to improve the process of identifying suspected and confirmed cases of the Zika virus. These infections are difficult to confirm retrospectively as the diagnosis is largely based on symptoms and the person’s recent history (e.g. fever and rash with a history of travel to an area where the Zika virus occurs). Laboratory tests may be difficult to interpret since there may be cross-reactivity with other circulating flaviviruses such as dengue, West Nile, and yellow fever. Also, the timing of testing is important, because the preferred test for Zika (PCR assay) only identifies the virus during the first 5-7 days of illness. Because Zika is so new, there are no specific ICD-10 diagnosis codes for providers to code encounters where the Zika virus infection was suspected or confirmed. Furthermore, the virus can cause non-specific symptoms or frequently no symptoms.

The VA’s approach was to combine data from multiple sources using the Praedico platform to create a single unified view of the burden of Zika in the VA patient population. By looking at a combination of related ICD-10 diagnosis codes, clinical lab reports, hospitalization records, travel advisories, and social media feeds, epidemiologists were able to run a number of complex search queries across multiple data sets. This enabled them to cast a wider net to rapidly identify people at risk of Zika infections and to gain a better sense of the spectrum of the disease. For example, our tool can be used to help identify pregnant Veterans who may require additional testing or follow-up in areas with active Zika virus transmission (such as Puerto Rico, the U.S. Virgin Islands and American Samoa).

Another area where we have seen data dramatically assist epidemiologists is in detecting patient infections caused by contaminated medical devices, such as from an improperly cleaned endoscope. This requires a look-back investigation where the epidemiologist looks at events from an earlier time period and analyzes multiple data sets including patient medical records, clinical lab reports, databases of identification numbers barcoded on each device, surgical data, and CPT procedure codes. By combining all these data sources, the investigator can identify devices and determine which patients were exposed and potentially infected and take corrective action.

New Possibilities

As exciting as the progress described above is, we are only in the beginning of the public health data revolution. The first phase has been mastering the art of bringing data together and integrating it meaningfully, with the guidance of subject matter experts to make sure the context is correct. The next step is to search for the “known unknowns,” setting the system algorithms to automatically be on the lookout for specific trends or parameters that are indicative of a particular outbreak or other anticipated occurrences. Ultimately, as the machine learning evolves we hope it will become good at prediction. For example, the system might detect a Norovirus outbreak in a specific area. Potentially, it could alert hospitals or healthcare providers in the affected localities and advise them to order Norovirus specific testing for any patients they may see with diarrhea or related symptoms. Without this type of alerting and notification, providers may not perform this specific test. With early alerting of outbreaks, patients can potentially be treated earlier and more effectively and the spread of the disease is contained so that fewer people are infected.

Bitscopic is excited to be part of the 2016 Council of State and Territorial Epidemiologists (CSTE) Annual Conference in Anchorage, Alaska. We will be exploring these issues further in a panel discussion entitled “Mastering the Data Haystack: Unifying Data Across Silos for New Insights” on Tuesday, June 21st at 7:30 AM. If you will be at the conference, we hope to meet you there. For those who can’t be there but are interested in exploring these issues further, stay tuned as we will have material from the panel posted later.