Earlier Diagnosis of Rare Diseases: The Role of AI And Machine Learning

Within Europe, rare diseases are defined as those that affect fewer than 1 in 2,000 people. At least 9,000 rare diseases have been recognised  and collectively, it is expected that 1 in 17 will be affected by a rare disease at some point in their lives.   

Although many rare diseases may be disabling or even fatal, the period between initial symptom onset and final diagnosis of a rare disease takes, on average, 4 years. This period, referred to as the ‘diagnostic odyssey’, often involves patients being shuttled between different medical specialists and subject to a wide variety of tests. 

It may be speculated that the reason for this delay is due to the infrequent occurrence of rare diseases, leading to many primary care physicians not considering their possibility during diagnostic workups. In many cases, the physician has never seen a patient with similar disease symptoms in their career.  

So how can the diagnosis of rare diseases be improved? Many think the answer lies in applying the artificial intelligence (AI) techniques collectively referred to as machine learning to the problem. 

Many medical institutions have now adopted Electronic Health Records (EHRs) as a means of recording information of a patient’s medical history, meaning real-world data is more accessible. This creates an abundance of opportunities for secondary use. In conjunction with medical experts, researchers can use this data to develop predictive models to identify patients with rare diseases prior to significant disease progression, thus reducing the diagnostic odyssey and improving patient outcomes.

The first step in developing such a predictive model is to identify the cohort used to train the model and extract the EHRs for said patients. This cohort will consist of confirmed positive cases and a set of control patients who have been ruled out for having the target disease. Ideally, this set of control patients will possess similar characteristics to the positive cases. As rare disease patients are likely to utilize healthcare facilities more often, ensuring a comparable control sample will enable the predictive model to identify the EHR footprint of a rare disease patient, rather than merely distinguish between patients with a high healthcare utilization and low healthcare utilisation. 

Data fields from the EHRs can then be parsed into features for the predictive model to learn from. These features can include structured fields, such as lab test results and demographics, or unstructured fields like doctor’s notes, which can be extracted using Natural Language Processing (NLP) techniques. It is likely that these medical records will contain hundreds or thousands of predictive features, and these can be narrowed down to the most important through feature selection techniques.

Machine Learning models can then be applied to the data and validated by ensuring the highest possible True Positive Rate. The model can then generate risk scores for each patient, providing a means of stratifying patients. For example, control patients with a risk score higher than the minimum positive patient score may benefit from further diagnostic testing.

Advancing drug development using predictive analytics

The benefits of Artificial Intelligence for earlier patient diagnosis does not stop there. Between 1983 and 2020, 599 orphan products to treat rare diseases have come onto the market, however only 10% of all products have three or more orphan indications. This indicates the majority of treatments target relatively few diseases, and thus small numbers of patients. 

One of the greatest challenges that life science companies face when developing orphan drugs is difficulties identifying and treating small patient populations, resulting in studies that lack statistical power. The creation of algorithms that can make better use of the full range of biomedical data and identify patient populations can help organisations in the drug development process by aiding with recruitment bottlenecks. This can also provide organisations with the data to drive business decisions, such as the number of patients available to support the initial investment. 


Although Machine Learning in the context of rare diseases offers significant benefits, the quality of the output is as only good as the input. Due to the dispersed rare disease population and slow diagnosis times, the amount of information for rare disease patients can be improved. It is therefore imperative that future work in the field involves a coordinated international approach that involves sharing data across geographic borders, increasing the volume of information machine learning algorithms must train on.

Because rare diseases often result in life-threatening or chronically debilitating conditions, the time from symptom onset to treatment is invaluable. Platforms that can harness the power of predictive analytics, such as RwHealth’s Data Science Platform will play an important role in the future of diagnosing rare diseases. The ability of the DSP to leverage quality datasets and its application of advanced machine learning algorithms will ensure that both patients and organizations will benefit from a timely and accurate diagnosis. 

To learn more about the DSP or to request a demo, email info@realworld.health

Press enquiries

If you are a journalist looking for expert comment for news or features – or to arrange interviews, filming and photography – please email muna@realworld.health