Resources


Database Credentialed Access

MIMIC-IV-Ext-GPT-3_5-Generated-Discharge-Summaries-for-Low-Resource-Codes

Matúš Falis, Aryo Pradipta Gema, Hang Dong, Luke Daines, Siddharth Basetti, Michael Holder, Rose Penfold, Alexandra Birch, Beatrice Alex

9,606 Synthetic Discharge Summaries generated by GPT-3.5 based on combinations of ICD-10-code descriptions associated with real discharge summaries in MIMIC-IV. Focus on low resource codes.

icd coding large language model data augmentation

Published: Dec. 16, 2024. Version: 1.0.0


Database Open Access

Synthetic Mention Corpora for Disease Entity Recognition and Normalization

Kuleen Sasse, John David Osborne

We present the Synthetic Mention Corpora for Disease Entity Recognition and Normalization, containing 128000 disease mentions from the UMLS disorder group, generated by an LLM. This corpus aims to improve these tasks in biomedical and clinical texts.

nlp named entity recognition machine learning data augmentation entity normalization

Published: Feb. 3, 2025. Version: 1.0.0


Database Credentialed Access

MIMIC-IV-Ext-GPT-3_5-Generated-Discharge-Summaries-for-Low-Resource-Codes

Matúš Falis, Aryo Pradipta Gema, Hang Dong, Luke Daines, Siddharth Basetti, Michael Holder, Rose Penfold, Alexandra Birch, Beatrice Alex

9,606 Synthetic Discharge Summaries generated by GPT-3.5 based on combinations of ICD-10-code descriptions associated with real discharge summaries in MIMIC-IV. Focus on low resource codes.

icd coding large language model data augmentation

Published: Dec. 16, 2024. Version: 1.0.0


Database Restricted Access

TAME Pain: Trustworthy AssessMEnt of Pain from Speech and Audio for the Empowerment of Patients

Tu-Quyen Dao, Eike Schneiders, Jennifer Williams, John Robert Bautista, Tina Seabrooke, Ganesh Vigneswaran, Rishik Kolpekwar, Ritwik Vashistha, Arya Farahi

TAME Pain is a dataset that captures acoustic signals of pain and is augmented by annotating every sentence the participant speaks with details such as background and foreground noise, speech errors, and non-speech vocal features.

audio speech pain cold pressor task

Published: Jan. 21, 2025. Version: 1.0.0


Database Credentialed Access

EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems

Konstantin Kotschenreuther

Dataset consisting of question and answer pairs synthetically generated from medical discharge summaries, designed to facilitate the training and development of large language models specifically tailored for healthcare applications

mimic-iv clinical question-answering medical discharge summaries large language models

Published: Jan. 11, 2024. Version: 1.0.0


Database Open Access

Synthetic Mention Corpora for Disease Entity Recognition and Normalization

Kuleen Sasse, John David Osborne

We present the Synthetic Mention Corpora for Disease Entity Recognition and Normalization, containing 128000 disease mentions from the UMLS disorder group, generated by an LLM. This corpus aims to improve these tasks in biomedical and clinical texts.

nlp named entity recognition machine learning data augmentation entity normalization

Published: Feb. 3, 2025. Version: 1.0.0