Enormous amounts of data are generated during clinical interactions across multiple-healthcare settings in the form of structured and unstructured EHRs. The data contains rich, longitudinal information on diagnoses, symptoms, medications and tests which can be used for research. However, EHR data is not primarily generated for research purposes; is stored in disparate sources often using different formats and requires a significant amount of pre-processing.

Our Phenotype Library

The UKhas established a,ɳ󾱳one of thein the world. It is the only national whollyopen-access library of reproducible phenotyping algorithms for defining human disease, lifestylerisk factors and biomarkersusing diverse electronic health records.For each phenotype, the library curatesitsmetadata, implementation details, programmaticcodeand validation information. TheLibraryenables reproducible and transparent research using such complex data by the wider research and clinical community.

Researchers hoping to unlock the valuable data contained with EHRs need to spendconsiderable time creating the coding neededto work with data that often containsinconsistencies and is of varyingqualityanddetail.

The 51 ʳԴdzٲLibrary hasbeen created to assist researchers working with EHRs, by creating an openaccess national library of validated phenotyping algorithms,definitionsandmethods. Routine use of the libraryby researcherswill cut down on the duplication of effort by allowing re-use of algorithms, tools and methodsandwillensure reproducibility of research by creating a national standard for creating,evaluatingand representing phenotypes.

Are you a researcher that has developed a phenotyping algorithmthat:

  • defines a disease,risk factor or biomarker,
  • derivesinformation from one or more EHR sources,
  • is associated with one or more peer-reviewed output and
  • isalreadyvalidated?

Youcan contribute to the improvement ofhealth by depositing your algorithms in the Phenotype Libraryenabling their dissemination, re-use, evaluation, and citationto the benefit of the emergingphonemicsresearchcommunity.

The phenome national priority is developing toolswhichwillsupportthedefinitionand creation of computable phenotypes, which can be used to interrogate EHR datato enable healthresearchfor patients benefit.

  • is a phenotype definition model, which can be used to define phenotypes from EHRs and export them – this allows phenotypes definitions to be re-used across research institutes improving reproducibility. Over 300 phenotypes are currently downloadable from Phenoflowandcan be instantly used to interrogate local datasets.Phenoflowalso allows researchers to author new phenotypes and enables their validation against multiple data sources.

NationalMedicalText Analytics

The 51 Text Analytics Resource is the UK’s firstrepository of tools,methodsand datasets fornatural language processing(NLP)ofthe unstructured free text contained within electronic health records.Theresource willhelptheclinical and research community tounlock the rich data contained within electronic health records to deliver improvements in healthcare.

There is much value in theinformationincluded in EHRs,e.g.symptoms, tests, investigations, diagnosis, and treatments,which could help researchers and clinicianslearn how to tailor treatments more accurately for individual patients and to offer better and safer healthcare.However,most of the information held within these records is in written form – sometimes referred to as unstructured text – which is difficult to use in researchand is currently under-used for research.

To access the data held with unstructured text we need to develop special computerised tools to process these words to ensure we have a full picture of all patient symptoms, experiences and diagnoses to use in research for patient benefit. The 51 Text Analytics Resources is building aNLP research community that will address the complexity of clinical text through development of shared tools and standards.

A curated list of applications and datasets for healthcare text analytics can be found onHDR UK Text’sgithub“resources” repository, you can find some examples of these below:

  • Cogstack allows the extraction of information from unstructured data (e.g.PDF/MS Word documents, images) contained within Electronic Health Records (EHRs). This data, which is usually inaccessible, once extracted and processed viaCogStack can then be analysed in multiple ways.

  • MedCAT is a natural Language Processing tool which can be used to link the extracted EHR data to definitions of disease to answer research questions such as ‘therelationship between diseases and age?’ Over twelve million free text documents and over 250 million diagnostic results and reports have been processed withinCogStack, which is being implemented across three NHS Foundation Trusts (South London and Maudsley, King’s College Hospital, and University College London Hospitals).CogStackwas cited in theSecretary of State for Health and Social Care’s speech ‘Better tech: not a ‘nice to have’ but vital to have for the NHS’ (January 2020) and NHSX’s report ‘Artificial Intelligence: How to get it right’ (October 2019).

  • FMA allows the extraction of information including causes of death and other diagnoses from free text in EHRs. The algorithm makes use of Read Clinical Codes, whereby clinical terms are designated with code e.g. ‘Asthma’ = ‘H33..’, and the earlier iteration OXMIS (OXfordMedical Information System) Code, to identify ‘medical’ words within the text. FMA facilitates research using free text in EHRs (e.g.those deposited in the UK General Practice Research Database), reducing the need for manual analysis.

Use our NLPresources,applicationsand datasetsto .

Smartphones and wearable devices

ThemHealth toolboxwill enable researchers to rapidly spin up population level remote monitoring studies with data streams including active data (e.g.questionnairesand clinical assessments) as well passively generated data from smartphones and wearable devices linked to other data modalities such as EHRs.Using reproducible methods to analyse mHealth generated data researchers will be able to better understand the causes and consequences of disease. 

Our mHealth community is developing open access tools and software which will support researchers undertaking studies using health data collected via smartphones and wearables, for example:

  • RADAR-base is a remote data collection platform that enables health data collected from study participants via wearables and mobile technologies to be shared with and used by clinicians and researchers. Theplatform supportsstudy design and set up, active (e.g.the use of questionnaires) and passive (e.g.real time monitoring of movement) remote data collection and secure data transmission to the research/clinical team.

  • BiobankAccelerometerAnalysis is a tool to extract health information from large accelerometer datasets (usually captured via a wrist worn device that measures acceleration i.e.a person’s activity). The software generates time-series and summary metrics useful for answering key questions such as how much time is spent in sleep, sedentary behaviour, or doing physical activity and its health consequences.

Get involved

To find out more and to get involved, contact withSerina Hayes, Phenomics Programme Director, SpirosDenaxas,Phenomics National Resource Lead, Richard DobsonorAngusRoberts,Text AnalyticsNational ResourceCo-Leads.