
- Select a language for the TTS:
- UK English Female
- UK English Male
- US English Female
- US English Male
- Australian Female
- Australian Male
- Language selected: (auto detect) - EN
Play all audios:
ABSTRACT The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory
symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to
March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs, exhalations, and speech were
collected in the ‘Speak up and help beat coronavirus’ digital survey alongside demographic, symptom and self-reported respiratory condition data. Digital survey submissions were linked to
SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,565 of
72,999 participants and 24,105 of 25,706 positive cases. Respiratory symptoms were reported by 45.6% of participants. This dataset has additional potential uses for bioacoustics research,
with 11.3% participants self-reporting asthma, and 27.2% with linked influenza PCR test results. SIMILAR CONTENT BEING VIEWED BY OTHERS SOUNDS OF COVID-19: EXPLORING REALISTIC PERFORMANCE OF
AUDIO-BASED DIGITAL TESTING Article Open access 28 January 2022 DISSOCIATING COVID-19 FROM OTHER RESPIRATORY INFECTIONS BASED ON ACOUSTIC, MOTOR COORDINATION, AND PHONEMIC PATTERNS Article
Open access 28 January 2023 LONGITUDINAL VOICE MONITORING IN A DECENTRALIZED BRING YOUR OWN DEVICE TRIAL FOR RESPIRATORY ILLNESS DETECTION Article Open access 11 April 2025 BACKGROUND &
SUMMARY The scale and impact of the COVID-19 pandemic has created a need for rapid and affordable point-of-care diagnostics and screening tools for infection monitoring. The possibility of
accurate and generalisable detection of COVID-19 from voice and respiratory sounds using audio classification on a smart device has been hypothesised as a way to provide a non-invasive,
affordable and scalable option for COVID-19 screening for both personal and public health monitoring1. However, prior machine learning studies to determine the feasibility of COVID-19
detection from audio have largely relied on datasets which are too small or unrepresentative to produce a generalisable model, or which include self-reported COVID-19 status, rather than
gold standard PCR (Polymerase Chain Reaction) testing for SARS-CoV-2 infection (see Table 1). These datasets have a relatively small proportion of positive cases, and include inadequate
metadata for statistical evaluation. They largely do not enable studies using them to meet diagnostic reporting criteria (for example the STARD 20152 and forthcoming STARD-AI3 criteria),
such as: reporting the interval between reference test and recording, random sampling, or avoiding case control where positives and negatives are sourced from different recruitment channels.
Following the publication of initial studies reporting accurate classification of SARS-CoV-2 infection from vocal and respiratory audio4,5, the UK Health Security Agency (UKHSA, formerly
NHS Test and Trace, the Joint Biosecurity Centre, and Public Health England) were commissioned to collect a dataset to allow for the independent evaluation of these studies. Dataset analysis
was carried out by The Alan Turing Institute and Royal Statistical Society (Turing-RSS) Health Data Lab (https://www.turing.ac.uk/research/research-projects/turing-rss-health-data-lab). A
dataset larger than the majority of existing datasets was needed to provide sufficient instances of various recording environments and mobile devices (information which is not collected),
and to provide sufficient instances for the thousands of features or representations typically produced from short vocal audio samples6. Such a dataset also needed to be sufficiently large
and diverse to validate model performance across various participant demographic groups and presentations of SARS-CoV-2 infection. UKHSA developed an online survey to collect a novel
SARS-CoV-2 bioacoustics dataset (Fig. 1a,b) in England from 2021-03-01 to 2022-03-07 (Fig. 1d), during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron
variant sublineages7. Participants were recruited after undergoing testing for SARS-CoV-2 infection as part of the national “_Real-time Assessment of Community Transmission”_ (REACT-1)
surveillance study (https://www.imperial.ac.uk/medicine/research-and-impact/groups/react-study/studies/the-react-1-programme/) and the NHS Test and Trace (T&T) symptomatic testing
programme in the community (known as Pillar 2)8. To facilitate independent validation of existing models, audio samples common across existing studies were collected in the online survey,
including: volitional (forced) cough, an exhalation sound, and speech. These were linked to SARS-CoV-2 testing data (method, results, date) for the test undertaken by the participant either
as part of REACT-1 or T&T. Further data on participant demographics (age, gender, ethnicity, first language, location) and symptoms (type and date of onset) were collected in the online
survey to monitor potential bias. The UK COVID-19 Vocal Audio Dataset9 is designed for studies examining the possibility of classification of SARS-CoV-2 infection from vocal audio, including
for the training and evaluation of machine learning models using PCR as a gold-standard reference test10. The inclusion of influenza status (for REACT-1 participants in REACT rounds 16–18)
and symptom and respiratory condition metadata may provide additional uses for bioacoustics research. All summary statistics described in this manuscript reflect the open access version of
the UK COVID-19 Vocal Audio Dataset9 unless otherwise stated. Differences between the protected and open access dataset are described in Methods - Data Anonymisation. The protected version
of the UK COVID-19 Vocal Audio Dataset is fully documented in our pre-print data descriptor11. METHODS SURVEY DESIGN Survey questions and responses are listed in Supplementary Table S1.
Survey questions were designed to align to existing vocal acoustic data collection studies (see Table 1) and prevalence studies12, so that future comparisons of study demographics could be
made if necessary. These include variables that could be captured in vocal audio acoustic features and/or could be confounded with SARS-CoV-2 infection status13,14, for example, respiratory
symptoms, smoker status, and respiratory health conditions. The participant’s testing provider collected data on age, gender, ethnicity, geographical area and SARS-CoV-2 test result (and
associated information such as test type and PCR cycle threshold information, if available). To minimise data entry fatigue, these were linked to survey responses and not collected again
through the survey. Survey variables were also chosen to align with existing government surveys for ease of comparison: options available for ‘first language’ reflected those available in
the ONS 2011 Census15; symptom options combined those available in the ONS Coronavirus Infection Survey12 and the NHS Test and Trace symptom self-screening tool prior to April 2022
(https://www.nhs.uk/conditions/covid-19/covid-19-symptoms-and-what-to-do/). Additional symptom options were added 2021-07-21 (‘other symptoms new to you in the last 2 weeks’) and 2021-08-11
(‘runny or blocked nose’, ‘sore throat’) to capture symptoms reported at a higher frequency in COVID-19 variants circulating at the time16. All questions allowed a ‘prefer not to say’ option
to maintain participant control on the data they chose to share and to minimise non-response bias. The final survey questions requested participants to record short audio segments using the
microphone on their device, where the user interface for making the recordings was embedded in the online survey. Audio recordings, in order of participant submission, were: a sentence read
aloud, three successive _“ha”_ exhalation sounds, one volitional cough, and three successive volitional coughs. Audio prompts were chosen to be similar to those of existing datasets (see
Table 1), so that models trained on other datasets could be independently evaluated with this dataset. On completion of the survey, responses including audio data were sent to a secure
server and temporarily held before being sent to UKHSA. Screenshots of the survey are shown in Fig. 1b. Two cough recordings were captured of one and three successive coughs, matching the
prompts for cough recordings captured in previous studies (see Table 1). A cough is an innate reflex to remove irritation in the respiratory tract, in order to enhance gas exchange. Coughs
are typically associated with respiratory infection, and a new, persistent cough was one of three ‘classic’ COVID-19 symptoms, however it was less prominent compared to other respiratory
symptoms in later variants16. The difference between a reflexive and volitional cough should be noted, where a volitional cough may differ in duration and power17, and may be affected by the
participants surroundings and emotional state. All coughs recorded in this study should be volitional, although a volitional cough may trigger a reflexive cough. Instructions were given to
record the cough samples at an arm’s length, following the advice provided to participants of the COUGHVID study18, to reduce the risk of the audio recording being distorted (clipped).
Participant instructions included guidance on coughing alone in a room or vehicle to reduce risk of COVID-19 transmission to others. Prompts and instructions are listed in Supplementary
Table S1. The first (out of four total coughs) per participant may involve more fluid clearance. Of the successive (final three) coughs, the first was likely to be the most powerful, and
successive coughs were likely to decrease in acoustic power as the participant had less time to inhale. Exhalation sounds were also collected, as in previous studies. Breathing sounds are
used in lung auscultation to identify narrowed airways or excessive fluid19 in the respiratory tract, although the clinical utility of external recordings (without a stethoscope) has not
been established. Participants are prompted to record three short, powerful exhalations (_“ha”_ sounds, as if the participants “were trying to fog up a window, or see their[sic] breath in
cold weather.”). Participants were recommended to make this recording in a quiet environment to reduce background noise. For this recording, there was no direction around distance from
participant to the recording device. A sentence of speech, read from text, was also collected. Vowel sounds (such as ‘aah’ or ‘ee’) are used in lung auscultation (egophony) and to examine
the vocal tract (such as contraction of the soft palate20). As speech is a combination of many varying vocal tract configurations over time, making it a more complex sample (anatomically)
than coughing or breathing, it is more likely to be prone to biases in cognition, literacy, and accent and other learnt speech patterns. However, speech samples may potentially be more rich
in acoustic features, particularly since smart device microphones, audio data processing, and the majority of vocal audio feature extraction models are configured for speech. Speech is
produced through volitional manipulation of the vocal tract, where the shape of air cavities and air pressure is varied. A short sentence, _“I love nothing more than an afternoon cream
tea”_, was chosen, combining several vowel and nasal sounds in a single recording. RECRUITMENT Participants were recruited through two existing SARS-CoV-2 infection testing pathways in
parallel: (1) a community prevalence survey and (2) a government testing service. They were invited to take part in the study after they underwent testing. Survey responses and audio
recordings were then linked to their test result. Inclusion criteria across both recruitment channels were: being 18 years of age or older and having a COVID-19 test barcode number.
Participants were also advised to participate only if they had tested in the last 72 hours, although 13.2% of REACT-1-recruited and 2.1% of NHS Test and Trace-recruited participants in the
dataset have a submission delay exceeding 72 hours, see Fig. 1e. (submission delay is described in the participant metadata file, see Supplementary Table S3). Participation was completely
voluntary. This study includes some participants with a permanent address in the UK devolved administrations (Northern Ireland, Scotland, Wales), however, the data is disproportionately
England-sampled due to the England-only recruitment of the majority of recruitment routes. Participants were recruited via the Real-time Assessment of Community Transmission-1 (REACT-1)
study. REACT-1 was commissioned by the UK Department of Health and Social Care to estimate the prevalence of SARS-CoV-2 infection in the community in England (and influenza A and B in later
survey rounds). This was carried out by Imperial College London in partnership with Ipsos MORI using repeat, random, cross-sectional sampling of the population. Participants were randomly
selected from National Health Service England records (which include almost the entire population) and sent a letter of invitation, with the aim of creating a representative sample of the
population for each survey round (although actual response demographics vary, see Usage Notes). Participants were provided with instructions to take a throat and nasal self-swab and were
asked to respond to an online/telephone survey about their demographics, symptoms and recent behaviours. The swab was either posted or collected by courier for PCR testing at laboratories.
For rounds 13–18 (REACT survey dates from 2021-06-24 to 2022-03-01), participants were asked if they agreed to be contacted about further research led by the UKHSA. After sending their swab
to a laboratory and completing the REACT-1 survey, those who agreed to be contacted were sent an email invitation to the online survey for this study which included audio recordings (survey
questions and responses listed in Supplementary Table S1). 12.2% of the 295,493 individuals contacted for recruitment in REACT rounds 14–18 participated in the study and are included in the
final dataset. Supplementary Table S2 lists participant cohorts and recruitment methods in further detail. Participants were also recruited via SARS-CoV-2 testing services delivered by NHS
Test and Trace (T&T). The purpose of this recruitment channel was to increase the number of survey responses linked to a positive PCR test result to better balance the combined dataset
by SARS-CoV-2 infection status. Where the prevalence of SARS-CoV-2 infection of the REACT-1 cohort was expected to be similar to the prevalence in the general population, a higher proportion
of positive cases may be needed for the development of SARS-CoV-2 infection status classification models. During the study period, people were advised to seek a PCR-test through T&T
(swab testing for the wider population, as set out in government guidance, known as Pillar 28) if they were (i) experiencing COVID-19 symptoms, (ii) identified as a close contact of a
positive case, or (iii) taking a confirmatory PCR test following a positive rapid antigen (lateral flow) test (until 11th January 2022). Tests were free to use and available at test sites or
for home delivery. Throat and nasal swabs were mostly self-administered at test sites or in participants’ homes, before being sent to laboratory sites for testing8. A subset of participants
recruited through T&T reported lateral flow test results. Lateral flow testing of SARS-CoV-2 antigen was open and free to the public in the UK, including for asymptomatic testing, and
in the majority of cases was performed by the participant and reported through the NHS COVID-19 app or website. Those that underwent testing could opt-in to be contacted about participating
in research. An eligible subset of these were then contacted by text, email or phone call to invite them to participate in the study. Supplementary Table S2 lists participant cohorts and
recruitment methods in further detail. Eligible populations were defined by SARS-CoV-2 infection and symptom status over a distribution of ages. Recruitment was initially focused on those
receiving a positive test result. Between 2021-11-11 and 2022-03-04 recruitment was targeted at 50% of a random sample of all that includes those testing positive, negative or with a void
PCR test result. Participants were linked to an online survey where prompts and audio recording format were uniform across recruitment channels. A small proportion of participants prior to
2021-03-17 were recruited via information leaflets at regional COVID-19 test sites, displaying a QR code linked to the study survey. Participants were also recruited from The ONS Coronavirus
Infection Survey (https://www.ons.gov.uk/surveys/informationforhouseholdsandindividuals/householdandindividualsurveys/covid19infectionsurveycis) and the COVID-19 Challenge study (COV-CHIM
01, https://www.hra.nhs.uk/planning-and-improving-research/application-summaries/research-summaries/cov-chim01-sars-cov-2-dose-finding-infection-study_v10/) however lower participant counts
from these recruitment methods could not guarantee participant anonymisation and so these participants are not included in the UK COVID-19 Vocal Audio Dataset. DATA COLLECTION The online
survey ‘Speak up and help beat coronavirus’ (https://www.gov.uk/government/news/speak-up-and-help-beat-coronavirus-covid-19) was accessible via compatible internet connected devices with the
ability to capture audio recordings, such as smartphones, tablets, laptops, and desktop computers. Participants recruited from T&T testing services were contacted to take part in the
study after completing a SARS-CoV-2 test and agreeing to take part in research (see Supplementary Table S2 for modes of contact). Those recruited from the REACT-1 cohort were sent an email
invitation. Participants reviewed the participant information and confirmed their informed consent to take part. An automated check to confirm a participant’s device was able to record audio
was integrated into the digital survey, which participants needed to complete before continuing. Participants accepted a participation agreement and privacy statement outlining how their
survey and test data would be linked, how their data would be used for research, and made available for reuse by researchers. Next, they entered their test/personal barcode number, followed
by responses to questions about their demographics, comorbidities and any symptoms they were currently experiencing. Participants responded to survey questions from a choice of predefined
responses (survey questions and multiple-choice responses listed in Supplementary Table S1, survey completion rates listed in Fig. 1c). Until 2021-08-12, the ‘alpha phase’ gathering solution
was hosted at www.ciab2021.uk (used by 18.6% of participants, noted as ‘alpha’ in the ‘survey_phase’ metadata variable, see Supplementary Table S3), and from 2021-08-13 to 2022-03-07, the
‘beta phase’ data gathering solution was hosted at www.speakuptobeatcovid.uk (used by 81.4% of participants, noted as ‘beta’ in the ‘survey_phase’ metadata variable, see Supplementary Table
S3). To ensure robustness, both data gathering solutions were tested extensively to ensure data gathered was recorded accurately in the databases of the respective solution, and to confirm
that the data was subsequently transferred to UKHSA correctly. This included end-to-end tests with dummy submissions. The API and associated configuration used for recording audio in the
‘alpha’ solution was replicated as-is in the ‘beta’ solution. Recordings through both data gathering solutions were compared to check consistency, including comparison of Fast Fourier
Transform (FFT) spectra, file format, and sampling rates using the python librosa library (https://github.com/librosa/librosa). The solution delivery teams for both ‘alpha’ and ‘beta’ data
gathering solutions confirmed that no post-processing of the stored audio files occurred for either solution. DATA LINKING A data pipeline was designed to merge the primary data gathered in
support of this study (the submission data) with the secondary data (the SARS-CoV-2 test results data) gathered by each testing provider. Data pipeline code was drafted and peer-reviewed by
the UKHSA study team, and was also reviewed independently by the data wrangling team from The Turing-RSS Health Data Lab. Survey data and audio recordings submitted by the participant were
linked to the SARS-CoV-2 test result data (date, result, test type, testing laboratory, PCR cycle threshold values (if provided), and estimated viral load (if provided)) for the test they
underwent prior to being recruited. They were also linked to demographic information of potential additional utility to the dataset (age, gender, ethnicity, geographical information,
COVID-19 vaccination status) which was collected by the testing provider. Test barcodes were used to link T&T data to survey data. Test results data from T&T-recruited participants
were sourced from the National Pathology Exchange (NPEx) database that stored test result data from across the T&T laboratory network and home-based lateral flow test results.
Participant age, gender, ethnicity, and location were derived from data entered by the participant when registering for a test and stored in the NPEx database. The study team generated a set
of unique personal codes, which were provided securely to Ipsos MORI, who included a code in each email invitation to participate in this study. These personal codes differed in format from
T&T barcodes to avoid accidental duplication. This personal code was used to link REACT-1 data to survey data. For REACT-1-recruited participants, the test result and associated data
were provided by Ipsos MORI as a filtered extract of the REACT-1 study data including only the records and fields relevant to this study. Participant codes were extracted from responses to
the survey for this study and transferred via approved protocols to Ipsos MORI. Ipsos MORI checked for duplicate entries before then extracting and sharing the test result data for UKHSA to
link back to survey submissions. The pipeline script was designed to exclude any participant submissions from the final dataset that could not be linked to valid test results data using the
test identifier code submitted by the participant. The test identifier codes were not publicly available and were provided to the participant through the relevant recruitment route,
mitigating the risk of any submissions where the primary data gathered was provided by a different individual from the secondary test results data. To further mitigate this risk, as well as
provide a metric required for the study exclusion criteria, the pipeline script calculates the delay between the time of the participant’s submission to the primary data gathering solution,
and either the swab time or lab processing time for the SARS-CoV-2 test the participant conducted. This delay was calculated as the difference between the time of the participant’s
submission to the primary data gathering solution, and either the swab time or lab processing time for the test ('submission_delay' variable, see Supplementary Table S3). This
variable enables results to be filtered out from the study data if there was a significant delay between submission and SARS-CoV-2 test, as this could either suggest the participant entered
the test identifier incorrectly and has been associated with the wrong test result record, or due to the delay the test may no longer be indicative of the participant’s SARS-CoV-2 infection
status. Participant submissions which could not be linked to a valid SARS-CoV-2 test result were excluded from the final dataset. Test barcode numbers were removed at the end of the study,
and replaced with a random identifier associated with an individual participant ('participant_identifier' metadata variable, see Supplementary Tables S3, S4), de-linking the
participant metadata from their test barcode number and identifiable information associated with it. DATA CLEANING Duplicate entries from the same participant were removed where possible so
that the data tables have one row (equivalent to one survey entry) per participant. For T&T-recruited participants, repeat individuals were identified in the source test results data
table, and repeat submissions were removed keeping only each individual’s first submission, which would be closer to the time of the participant’s SARS-CoV-2 test. For the REACT-1-recruited
participants, Ipsos MORI indicated which submissions related to individuals who had previously taken part in the study, and repeat submissions were removed keeping only first submissions.
There remains a residual risk that some individuals took part in both recruitment groups and as a result have made multiple submissions in the study data, however, this is expected to be a
low volume due to the national scale of both recruitment approaches. The removal of duplicates cannot be guaranteed prior to 2021-06-01 (1.4% of participants), as a shorter agreed personal
data retention period for this ‘pilot’ phase of data collection meant that test barcodes could not be stored for the duration of the study. Variable categories were standardised for
uniformity across recruitment channels where there was overlap between categories. REACT category names were typically renamed to match T&T category names. Data types were standardised
by variable (unless disclosure controls required mixed data types). List variables from survey multiple choice questions were one-hot encoded. DATA ANONYMISATION To enable wider
accessibility, an open access version of the UK COVID-19 Vocal Audio Dataset9 was produced to protect participant anonymity according to the ISB1523: Anonymisation Standard for Publishing
Health and Social Care Data standards
(https://digital.nhs.uk/data-and-information/information-standards/information-standards-and-data-collections-including-extractions/publications-and-notifications/standards-and-collections/isb1523-anonymisation-standard-for-publishing-health-and-social-care-data).
The audio recordings of read sentences were removed in the open access version of the dataset, on the basis that non-distorted speech data can constitute sensitive biometric personal
information, carrying a risk of participant reidentification. Additionally, several participant metadata variables were either removed, binned, obfuscated, or pseudonymised to meet the
requirement of K-3 anonymity after combining all variables relating to personal data. The total number of participants remained unchanged. Specifically, audio metadata relating to the audio
recordings of read sentences were removed, including audio transcripts. Participant metadata variables relating to ethnicity, first language, vaccination status, height, weight, and COVID-19
test laboratory code were removed. Participant age was binned into age groups, and survey recruitment source variables were binned into general recruitment source groups. All dates were
indexed to a random date and obfuscated with ± 10 days random noise. All dates included in the metadata are indexed to the same random date for comparison, and all dates for each participant
have the same level of noise applied, to allow for the calculation of time differences at the participant level. Geographical information was originally collected at the local authority
(sub-regional administrative division) level, and was later aggregated to region (first level of national sub-division) level and pseudonymised to avoid the risk of participant disclosure. A
flagged COVID-19 test laboratory code in the protected dataset indicated a laboratory with reported false COVID-19 test results. All results from this laboratory have been set to None for
the open dataset version, slightly altering overall counts of positive, negative and invalid results. All summary statistics presented in this article reflect the open access version of the
UK COVID-19 Vocal Audio Dataset9 unless otherwise stated. A data dictionary for the open access version of the UK COVID-19 Vocal Audio Dataset metadata is provided in Supplementary Tables
S3, S4. ETHICS This study has been approved by The National Statistician’s Data Ethics Advisory Committee (reference NSDEC(21)01) and the Cambridge South NHS Research Ethics Committee
(reference 21/EE/0036) and Nottingham NHS Research Ethics Committee (reference 21/EM/0067). DATA RECORDS The open access version of the UK COVID-19 Vocal Audio Dataset has been deposited in
a Zenodo repository (https://doi.org/10.5281/zenodo.10043977)9, and is available under an Open Government License (v3.0,
https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/). Additional data records as part of the protected dataset version may be requested from UKHSA
([email protected]), and will be granted subject to approval and a data sharing contract. To learn about how to apply for UKHSA data, visit:
https://www.gov.uk/government/publications/accessing-ukhsa-protected-data/accessing-ukhsa-protected-data. There were 72,999 participants included in the final dataset, with one submission
per participant. This included 25,706 participants linked to a positive SARS-CoV-2 test. The majority of these submissions (70,565, 96.7%) were linked to results derived from PCR tests
(RT-PCR, q-PCR, ePCR), followed by lateral flow tests (1,925, 2.6%) and LAMP (loop‐mediated isothermal amplification) tests (244, 0.3%). Of all test results, 257 (0.4%) were inconclusive
with an unknown or void result. This dataset represents the largest collection of PCR-referenced audio recordings for SARS-CoV-2 infection to date, with approximately 2.6 times more
participants with PCR-referenced audio recordings than the Tos COVID-19 dataset (with 27,101)21. The majority (44,565 participants, equalling 61.0% of total dataset) of participants were
recruited via REACT-1. The remaining 28,434 participants, (equalling 39.0% of the total dataset) were recruited via T&T testing services. Figure 2 shows a summary of participant
attributes. The median age of participants was 53 years old and 59.6% of participants were female (43,537 participants). Participants with a positive SARS-CoV-2 test result were more likely
to report that they were experiencing respiratory symptoms (90.7% of participants testing positive reported respiratory symptoms vs. 20.9% of participants testing negative reported
respiratory symptoms). This dataset has additional potential uses for bioacoustics research, as 9,749 (13.4%) participants reported a pre-existing respiratory health condition (comorbidity)
of which 8,249 (11.3%) reported asthma. Participants recruited from REACT rounds 16–18 (19,859 participants, 27.2% total) have linked influenza PCR test results, with 33 participants testing
positive for influenza A and 28 testing positive for influenza B. Several recruitment biases were apparent and are discussed in Usage Notes. Audio was recorded in _.wav_ format (86.2% of
submissions had a sample rate for all recordings of 48 kHz, 13.2% of submissions had a sample rate for all recordings of 44.1 kHz) and had a maximum length of 64 seconds (see Fig. 3a). Three
audio files (one for each recording) are provided for each of the 72,999 participants (unless missing, see Technical Validation). The protected version of the dataset contains one extra
audio file (of recorded speech) per participant with a maximum length of 72 seconds. Metadata, including audio filenames, are provided in .csv files, linked by a participant identifier code.
Metadata data dictionaries are provided as tables for participant metadata (Supplementary Table S3) and audio metadata (Supplementary Table S4). Additional data available include the
participant training/testing splits for the investigations reported by _Coppock et al_.10 (in both open access and protected versions of the dataset) and OpenSmile features6 generated from
the audio files (in the protected dataset only). TECHNICAL VALIDATION Audio _.wav_ files were parsed and metadata extracted including sample rate, number of samples, and number of channels
using the python scipy library (https://github.com/scipy/scipy). Using the audio data and extracted metadata, the audio length in seconds, audio amplitude (absolute maximum - absolute
minimum signal), and audio signal-to-noise ratio were calculated for each file. Figure 3a–c show the distribution of audio file length, audio amplitude, and audio absolute signal mean to
standard deviation ratio for each audio recording type, respectively. All files had one audio channel. Audio metadata is available in the audio_metadata _.csv_ file of the dataset. 2.5%
participants were missing one or more audio files, or had audio files with a size of <45 bytes, and were flagged with the missing_audio variable. Empty audio files are not included in the
open access dataset, and so a small number of flagged audio file paths in the audio_metadata table will list a non-existent file. Audio metadata variable completeness by recording type is
included in Supplementary Table S4. Audio files were screened systematically to reduce the risk of disclosure of personal information. This could arise where participants had failed to
follow the study instructions and the audio prompts, instead accidentally or intentionally disclosing personal information such as their name. An analytical pipeline was developed to
identify outliers from the total 289,696 audio files (217,162 of which are in the open access dataset), where the outliers were screened manually. A speech-to-text model (fairseq S2T, small
version, pre-trained weights available at https://huggingface.co/facebook/s2t-small-librispeech-asr)22 was run on audio files, producing a text transcript. A text-to-embedding model (MPNet
Transformer v2, pre-trained weights available at https://huggingface.co/sentence-transformers/all-mpnet-base-v2)23 was run on the transcript, producing a format of the transcript (an
embedding) that was quantified and compared with the prompt to identify outliers, or speech that differs from the prompt. Each embedding was then ranked by its similarity to the prompt using
a Support Vector Machine (SVM) model. The 1000 sentences which differed most were then inspected manually to check for disclosure of personal information, and non-outlier files were
randomly sampled. Figure 3d shows the distribution of the similarity rank for every audio file for the sentence modality. The majority of transcripts (56.1%) show the correct sentence (_“I
love nothing more than an afternoon cream tea”_). Most others sampled show a similar sentence, misinterpreted by the speech-to-text model, with 95.6% of transcripts containing the substring
_“nothing more”_, and 91.8% containing _“nothing more”_ and _“afternoon”_, with the other words commonly mis-transcribed. A small proportion of sampled outlier transcripts show alternative
speech from the participant. These transcripts and associated audio files were retained unless personal information was disclosed (one audio file was found to contain personal information
and was truncated). Other outliers had media playing in the background, or others show artefacts of noise e.g. “_of of of of of of of of of…_”. Sentence transcripts and outlier scores are
available in the audio_metadata table of the protected version of the dataset. Several data filtering steps are recommended when using this dataset for the development of models with the
intention of SARS-CoV-2 infection classification, including filtering to include only participants with a PCR-type SARS-CoV-2 test, participants whose test was not carried out in a
laboratory with reported testing errors, and participants who completed the study survey within a defined delay (e.g. 10 days) of their SARS-CoV-2 test. _Pigoli et al_. describe these
suggested data filtering steps in further detail24. USAGE NOTES THE USE OF THIS DATASET FOR SARS-COV-2 INFECTION STATUS CLASSIFICATION For effective use of this dataset, users should be
aware of limitations in the use of surrogate indicators such as vocal biomarkers in the development of SARS-CoV-2 infection status classification models. Many SARS-CoV-2 infections are
asymptomatic, and the presentation of any symptoms may be dependent on the stage of infection, which may not necessarily correlate with viral load and transmissibility25. 9.3% of
participants with a positive SARS-CoV-2 test result do not report any respiratory symptoms, and 3.8% report not having symptoms of any type. 20.9% of participants with a negative test result
reported respiratory symptoms. The specificity of audio-based SARS-CoV-2 detection may also be dependent on the prevalence of other circulating respiratory viruses, which may have similar
respiratory symptoms and effect on vocal audio. All participants with a positive test result for influenza also had a negative test result for SARS-CoV-2 infection. Other respiratory
infections that may have a similar symptom profile were not tested in this study. Other recorded variables such as respiratory conditions and smoker status may be a confounding variable in
the analysis of vocal biomarkers. 14.6% of participants with a positive SARS-CoV-2 test result also reported a respiratory health condition (including asthma, COPD, and emphysema). _Coppock
et al_. further analyse the potential confounding effect of several variables in this dataset on SARS-CoV-2 infection classification10. There is some selection bias in the recruitment for
this study, where the majority of participants recruited via the REACT-1 surveillance study were SARS-CoV-2 negative at the time of participation, and the majority of the participants
recruited via T&T were SARS-CoV-2 positive at the time of participation. This selection was necessary to produce a dataset with a relative balance of SARS-CoV-2 infection status, but
researchers should note the varying composition of each recruitment population, which could be confounded with infection status. Additionally, within T&T recruitment, the recruitment
method of some positive and negative cases varied (see Supplementary Table S2). _Pigoli et al_. document the potential confounding variables in this dataset due to recruitment24. Of
particular note is symptom presentation, where participants recruited via T&T would have sought a PCR test due to having a positive lateral flow test (between 2021-03-2926 and
2022-01-1127) or having symptoms (at least one of: a high temperature; a new continuous cough; a change to sense of smell or taste) as per UKHSA guidance at the time of data collection8.
Changes in UK testing policy, such as local surge testing or school and workplace testing policies would also create selection biases for the T&T population that varied over time.
Compared to T&T, COVID-19 positive participants recruited via the REACT-1 study were less likely to be symptomatic. The distribution of symptom status was more likely to reflect that in
the general population and be stable over time (in relation to SARS-CoV-2 prevalence). Users should note that not all those who were contacted to participate in the REACT-1 survey
participated, creating some self-selection biases. Only those who participated in the REACT-1 survey were contacted to participate voluntarily in this study, leading to further
self-selection biases. _Coppock et al_. list seven core issues with existing COVID-19 audio research and the datasets used28. While we have designed this dataset attempting to address these
issues; including providing PCR-confirmed infection status, providing demographic and health metadata for each participant, ensuring only one submission per participant, and publishing this
dataset; several issues remain. Participants testing positive for SARS-CoV-2 infection at the time of participation may be aware of their infection status, particularly those recruited via
T&T (Fig. 1e). This may introduce undocumented confounders in the audio recordings, such as behaviour when recording, the environment in which recordings are made, and participant
emotions. Recruitment biases, discussed above, may also present undocumented confounders present in the audio data. Although there is little variation in audio sample rate across the dataset
(see Data Records), we did not record device type, microphone hardware specifications, browser, or device operating system, which may have some effect on audio quality. Although PCR is the
gold-standard for detection of SARS-CoV-2 infection, users should note that it may be an imperfect label in categorising participants as infectious. Due to the amplification step, viral RNA
can remain detectable by PCR long after live SARS-CoV-2 can be cultured from patient samples. Stratifying model evaluation by estimated viral load may remedy this, where studies have shown
that the viral load threshold for transmission is ~1,000,000 viral RNA copies/ml25. We include viral load data where available (14.6% of submissions). Users are encouraged to use the
covid_viral_load_category variable rather than the covid_viral_load, covid_ct_value or covid_ct_mean variables, as there is variation between tests including different gene targets
(documented by covid_ct_gene). False negative PCR results are also possible, likely related to sampling technique, volume of fluid, and viral load. A meta-analysis found a pooled estimate of
94% PCR sensitivity29. PCR results from a laboratory reported to have made substantial testing errors have been made void in this dataset. The period of data collection saw different
SARS-CoV-2 variants (notably Alpha, Delta and Omicron) circulating in the UK, which have been reported to cause differing prevalence of symptoms to each other and to previous variants16.
Dataset authors recommend against using the covid_ct_gene metadata variable to estimate SARS-CoV-2 variant causing infection (e.g. through S-gene dropout30), as this variable reports only a
single gene target with the lowest cycle threshold value, and not all laboratories test for all genes. Due to recruitment constraints, we were unable to include longitudinal data (multiple
data entries by the same participant over time) for any participants, as is present in other datasets such as the COVID-19 Sounds31 dataset. As a result, this dataset is insufficient to
study potential vocal changes throughout SARS-CoV-2 infection in the same individual. Temporal changes may be studied in a cross-sectional manner with appropriate controls using the
symptom_onset variable. Alternative COVID-19 and influenza-related uses of the dataset may include the development of generic respiratory symptom detection or cough and cough frequency
detection, which may have utility in syndromic surveillance (if used in a privacy-preserving manner) or the monitoring of chronic disease in addition to acute disease. Any developed solution
should be first trialled in the context of its application to provide evidence of patient safety, generalisability, and reported effectiveness. THE USE OF THIS DATASET FOR ASTHMA STATUS
CLASSIFICATION Of the 72,816 participants responding to the survey prompt regarding existing respiratory conditions, 8,249 (11.3%) report having asthma. This provides a large vocal audio
dataset labelled with participant asthma status, which may be used for training or evaluating machine learning models for asthma status classification. While asthma is a condition
characterised by respiratory symptoms, vocal audio should be considered a surrogate indicator compared to established diagnostic methods such as those measuring expiratory flow or
inflammation32. Unlike SARS-CoV-2 and influenza infection status, which is confirmed with a diagnostic test in this dataset, asthma status is self-reported. This can introduce confirmation
bias, where some undiagnosed asthma participants may be labelled as not having asthma. SARS-CoV-2 infection status and other respiratory conditions may be confounding factors in the use of
vocal audio for asthma status classification. 39.5% of participants reporting asthma also have a linked positive SARS-CoV-2 test result, 0.1% have a linked positive influenza A or B test
result, and 4.8% report another respiratory condition including COPD and emphysema. DATASET DEMOGRAPHIC BIASES Demographic imbalances are present within the dataset, where study participants
were more likely to be White British, women, and aged 35–74 years than the general UK population. Figure 4 compares the distribution of ages, genders, ethnicities, and region of habitation
of study participants in comparison to the general population (as recorded by the 2021 UK Census), patients using T&T in the weeks of data collection33 (compared to only
T&T-recruited study participants), and REACT-1 study participants for the relevant study rounds34,35,36,37 (compared to only REACT-1-recruited study participants). Granular age data,
ethnicity group and UK region data are available only in the protected version of the dataset, and are presented here for context. Some dataset biases can be seen to be partly inherited from
the two recruitment channels, as in the case of gender, where more patients or participants were women in comparison to the general population. Other biases, such as age biases, can be seen
to be exacerbated by the recruitment of this study, where fewer participants over the age of 80 years were seen in the two recruitment channels (and none under the age of 18 years, due to
study exclusion criteria). The survey for this study was only made available in English which could have exacerbated language and ethnicity biases. Similar demographic biases were present in
other voluntary digital surveys for COVID-19 research and surveillance in the UK, such as the COVID Symptom Study38. The nature of study recruitment and participation may exclude certain
demographic groups with limited digital literacy or access to digital infrastructure39. The voluntary nature of this study may exclude certain demographic groups with limited available time
due to employment and/or care commitments40. We recommend that researchers using this dataset to train audio classification models should report test accuracy statistics stratified by
demographic variables to communicate any model biases. The substantial majority of participants (94.5%) report English as their first language or the language most commonly spoken at home
(if they have two or more first languages). Therefore, any analysis of speech data may only be valid in English speakers and should be tested in other populations before
language-generalisable results are reported. Regional accents may have an effect on speech models. Recruitment is relatively balanced by administrative region, particularly for
REACT-1-recruited participants. As a result, the audio data may contain a representative sample of regional English accents. Most study participants were recruited in England and so, more
speech data would be needed to evaluate accents which are more common outside of England. The participant metadata variables not captured directly in the digital survey (digital survey
questions and related variables listed in Supplementary Table S1) were shared by the relevant recruitment channel (see Methods - Survey Design), where format and prompt vary. Efforts have
been made to standardise data format between recruitment channels and are listed in the participant metadata dictionary (Supplementary Table S3). Users should note that some calculated
variables, such as symptom_onset, continue to have values of distinct distributions despite this standardisation due to the variation in recruitment methods (patients seeking a test vs
survey population). T&T- and REACT-derived demographic variables had limited multiple-choice options and limited ethnicity and gender categories were available, meaning some demographic
analyses are not possible. CODE AVAILABILITY Summary statistics and relevant figures and can be reproduced from the open access or protected versions of the UK COVID-19 Vocal Audio Dataset
using code found here: https://github.com/alan-turing-institute/Turing-RSS-Health-Data-Lab-Biomedical-Acoustic-Markers/ which is archived41 under https://doi.org/10.5281/zenodo.11208315.
REFERENCES * Anthes, E. Alexa, do I have COVID-19? _Nature_ 586, 22–25 (2020). Article ADS CAS PubMed Google Scholar * Bossuyt, P. M. _et al_. STARD 2015: An Updated List of Essential
Items for Reporting Diagnostic Accuracy Studies. _Clin. Chem._ 61, 1446–1452 (2015). Article CAS PubMed Google Scholar * Sounderajah, V. _et al_. Developing a reporting guideline for
artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. _BMJ Open_ 11, e047709 (2021). Article PubMed PubMed Central Google Scholar * Laguarta, J.,
Hueto, F. & Subirana, B. COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings. _IEEE Open J. Eng. Med. Biol._ 1, 275–281 (2020). Article PubMed Google Scholar *
Brown, C. _et al_. Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data. in _Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery
& Data Mining_ 3474–3484 https://doi.org/10.1145/3394486.3412865 (2020). * Eyben, F., Wöllmer, M. & Schuller, B. Opensmile: the munich versatile and fast open-source audio feature
extractor. in _Proceedings of the 18th ACM international conference on Multimedia_ 1459–1462 https://doi.org/10.1145/1873951.1874246 (Association for Computing Machinery, New York, NY, USA,
2010). * UK Health Security Agency. _SARS-CoV-2 Variants of Concern and Variants under Investigation in England, Technical Briefing_ 39.
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1063424/Tech-Briefing-39-25March2022_FINAL.pdf (2022). * Department of Health & Social
Care. _COVID-19 Testing Data: Methodology Note_. https://www.gov.uk/government/publications/coronavirus-covid-19-testing-data-methodology/covid-19-testing-data-methodology-note (2020). *
Coppock, H. _et al_. The UK COVID-19 Vocal Audio Dataset. _Zenodo_ https://doi.org/10.5281/zenodo.10043977 (2023). * Coppock, H. _et al_. Audio-based AI classifiers show no evidence of
improved COVID-19 screening over simple symptoms checkers. _Nat. Mach. Intell_. 1–14 https://doi.org/10.1038/s42256-023-00773-8 (2024). * Budd, J. _et al_. A large-scale and PCR-referenced
vocal audio dataset for COVID-19. Preprint at https://doi.org/10.48550/arXiv.2212.07738 (2023). * University of Oxford & Office for National Statistics. _Protocol and Information Sheets,
COVID-19 Infection Survey_. https://www.ndm.ox.ac.uk/covid-19/covid-19-infection-survey/protocol-and-information-sheets (2022). * Pijls, B. G. _et al_. Demographic risk factors for COVID-19
infection, severity, ICU admission and death: a meta-analysis of 59 studies. _BMJ Open_ 11, e044640 (2021). Article PubMed Google Scholar * Zyl-Smit, R. N., van, Richards, G. &
Leone, F. T. Tobacco smoking and COVID-19 infection. _Lancet Respir. Med._ 8, 664–665 (2020). Article PubMed PubMed Central Google Scholar * Office for National Statistics. _2011 Census:
Detailed Analysis - English Language Proficiency in England and Wales, Main Language and General Health Characteristics_.
https://www.ons.gov.uk/peoplepopulationandcommunity/culturalidentity/language/articles/detailedanalysisenglishlanguageproficiencyinenglandandwales/2013-08-30 (2013). * Menni, C. _et al_.
Symptom prevalence, duration, and risk of hospital admission in individuals infected with SARS-CoV-2 during periods of omicron and delta variant dominance: a prospective observational study
from the ZOE COVID Study. _The Lancet_ 399, 1618–1624 (2022). Article CAS Google Scholar * Mills, C., Jones, R. & Huckabee, M.-L. Measuring voluntary and reflexive cough strength in
healthy individuals. _Respir. Med._ 132, 95–101 (2017). Article PubMed Google Scholar * Orlandic, L., Teijeiro, T. & Atienza, D. The COUGHVID crowdsourcing dataset, a corpus for the
study of large-scale cough analysis algorithms. _Sci. Data_ 8, 156 (2021). Article PubMed PubMed Central Google Scholar * Bohadana, A., Izbicki, G. & Kraman, S. S. Fundamentals of
Lung Auscultation. _N. Engl. J. Med._ 370, 744–751 (2014). Article CAS PubMed Google Scholar * Boyce, J. O., Kilpatrick, N., Teixeira, R. P. & Morgan, A. T. Say ‘ahh’… assessing
structural and functional palatal issues in children. _Arch. Dis. Child. - Educ. Pract._ 105, 172–173 (2020). Article Google Scholar * Pizzo, D. T. & Esteban, S. IATos: AI-powered
pre-screening tool for COVID-19 from cough audio samples. Preprint at https://doi.org/10.48550/arXiv.2104.13247 (2021). * Wang, C. _et al_. fairseq S2T: Fast Speech-to-Text Modeling with
fairseq. Preprint at https://doi.org/10.48550/arXiv.2010.05171 (2022). * Song, K., Tan, X., Qin, T., Lu, J. & Liu, T.-Y. MPNet: Masked and Permuted Pre-training for Language
Understanding. _Advances in neural information processing systems._ 33, 16857–16867, https://proceedings.neurips.cc/paper_files/paper/2020/file/c3a690be93aa602ee2dc0ccab5b7b67e-Paper.pdf
(2020). * Pigoli, D. _et al_. Statistical Design and Analysis for Robust Machine Learning: A Case Study from COVID-19. Preprint at https://doi.org/10.48550/arXiv.2212.08571 (2023). * Cevik,
M. _et al_. SARS-CoV-2, SARS-CoV, and MERS-CoV viral load dynamics, duration of viral shedding, and infectiousness: a systematic review and meta-analysis. _Lancet Microbe_ 2, e13–e22 (2021).
Article CAS PubMed Google Scholar * Department of Health & Social Care. _Government Reintroduces Confirmatory PCR Testing for Assisted Testing_.
https://www.gov.uk/government/news/government-reintroduces-confirmatory-pcr-testing (2021). * UK Health Security Agency. _Confirmatory PCR Tests to Be Temporarily Suspended for Positive
Lateral Flow Test Results_. https://www.gov.uk/government/news/confirmatory-pcr-tests-to-be-temporarily-suspended-for-positive-lateral-flow-test-results (2022). * Coppock, H., Jones, L.,
Kiskin, I. & Schuller, B. COVID-19 detection from audio: seven grains of salt. _Lancet Digit. Health_ 3, e537–e538 (2021). Article CAS PubMed PubMed Central Google Scholar *
Arevalo-Rodriguez, I. _et al_. False-negative results of initial RT-PCR assays for COVID-19: A systematic review. _PLOS ONE_ 15, e0242958 (2020). Article CAS PubMed PubMed Central Google
Scholar * World Health Organization. _Classification of Omicron (B.1.1.529): SARS-CoV-2 Variant of Concern_.
https://www.who.int/news/item/26-11-2021-classification-of-omicron-(b.1.1.529)-sars-cov-2-variant-of-concern (2021). * Xia, T. _et al_. COVID-19 Sounds: A Large-Scale Audio Dataset for
Digital Respiratory Screening. in _Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)_ https://openreview.net/forum?id=9KArJb4r5ZQ
(2021). * Hargreave, F. E. & Nair, P. The definition and diagnosis of Asthma. _Clin. Exp. Allergy_ 39, 1652–1658 (2009). Article CAS PubMed Google Scholar * _Weekly Statistics for_
NHS Test _and Trace (England) 2 to 15 June 2022_. 22
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1085136/NHS-test-and-trace-23-june-2022.pdf (2022). * Elliott, P. _et al_. Exponential growth,
high prevalence of SARS-CoV-2, and vaccine effectiveness associated with the Delta variant. _Science_ 374 (2021). * Chadeau-Hyam, M. _et al_. SARS-CoV-2 infection and vaccine effectiveness
in England (REACT-1): a series of cross-sectional random community surveys. _Lancet Respir. Med._ 10, 355–366 (2022). Article CAS PubMed PubMed Central Google Scholar * Elliott, P. _et
al_. Rapid increase in Omicron infections in England during December 2021: REACT-1 study. _Science_ 375, 1406–1411 (2022). Article ADS CAS PubMed Google Scholar * Chadeau-Hyam, M. _et
al_. Omicron SARS-CoV-2 epidemic in England during February 2022: A series of cross-sectional community surveys. _Lancet Reg. Health – Eur_. 21 (2022). * Davies, N. M. _et al_. Implications
of selection bias for the COVID Symptom Tracker Study. _Science_ (2020). * Office for National Statistics. _Exploring the UK’s Digital Divide_.
https://www.ons.gov.uk/peoplepopulationandcommunity/householdcharacteristics/homeinternetandsocialmediausage/articles/exploringtheuksdigitaldivide/2019-03-04 (2019). * Sullivan, O. &
Gershuny, J. United Kingdom Time Use Survey, 2014-2015. UK Data Service https://doi.org/10.5255/UKDA-SN-8128-1 (2021). * Turing-RSS Health Data Lab & The Alan Turing Institute.
alan-turing-institute/Turing-RSS-Health-Data-Lab-Biomedical-Acoustic-Markers archive. _Zenodo_ https://doi.org/10.5281/zenodo.11208315 (2024). * Zarkogianni, K. _et al_. The smarty4covid
dataset and knowledge base as a framework for interpretable physiological audio data analysis. _Sci. Data_ 10, 770 (2023). Article PubMed PubMed Central Google Scholar * Ponomarchuk, A.
_et al_. Project Achoo: A Practical Model and Application for COVID-19 Detection from Recordings of Breath, Voice, and Cough. _IEEE J. Sel. Top. Signal Process._ 16, 175–187 (2022). Article
ADS PubMed Google Scholar * Bhattacharya, D. _et al_. Coswara: A respiratory sounds and symptoms dataset for remote screening of SARS-CoV-2 infection. _Sci. Data_ 10, 397 (2023).
Article PubMed PubMed Central Google Scholar * Chaudhari, G. _et al_. Virufy: Global Applicability of Crowdsourced and Clinical Datasets for AI Detection of COVID-19 from Cough. Preprint
at https://doi.org/10.48550/arXiv.2011.13320 (2021). * Office for National Statistics. _Population and Household Estimates, England and Wales: Census 2021_.
https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationandhouseholdestimatesenglandandwalescensus2021 (2022). Download references
ACKNOWLEDGEMENTS Authors gratefully acknowledge the contributions of staff from NHS Test and Trace Lighthouse Labs, REACT Study, Ipsos MORI, Studio24, Fujitsu Services Ltd. Authors in The
Alan Turing Institute and Royal Statistical Society Health Data Lab gratefully acknowledge funding from Data, Analytics and Surveillance Group, a part of the UKHSA. This work was funded by
The Department for Health and Social Care (Grant ref: 2020/045) with support from The Alan Turing Institute (EP/W037211/1) and in-kind support from The Royal Statistical Society. J.B. and
R.A.M. acknowledge funding from the i-sense EPSRC IRC in Agile Early Warning Sensing Systems for Infectious Diseases and Antimicrobial Resistance EP/R00529X/1. A.T.C. acknowledges funding
from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska–Curie grant agreement No 801604. S.P. acknowledges funding from the Economic and Social
Research Council (ESRC) [grant number ES/P000592/1]. G.N. acknowledges funding from the NIHR Biomedical Research Centre, Oxford (grant no. NIHR203311). AUTHOR INFORMATION AUTHORS AND
AFFILIATIONS * London Centre for Nanotechnology, University College London, London, UK Jobie Budd & Rachel A. McKendry * Division of Medicine, University College London, London, UK Jobie
Budd & Rachel A. McKendry * King’s College London, London, UK Kieran Baker, Vasiliki Koutra, Steven Gilmour & Davide Pigoli * The Alan Turing Institute, London, UK Kieran Baker,
Emma Karoune, Harry Coppock, George Nicholson, Vasiliki Koutra, Radka Jersakova, Sylvia Richardson, Björn W. Schuller, Steven Gilmour, Davide Pigoli, Stephen Roberts & Chris Holmes *
Imperial College London, London, UK Harry Coppock & Björn W. Schuller * UK Health Security Agency, London, UK Selina Patel, Richard Payne, Ana Tendero Cañadas, Alexander Titcomb, David
Hurley, Sabrina Egglestone, Lorraine Butler, Jonathon Mellor, Josef Packham & Tracey Thornley * Institute of Health Informatics, University College London, London, UK Selina Patel *
Centre for Stress and Age-Related Disease, School of Applied Sciences, University of Brighton, Brighton, UK Ana Tendero Cañadas * University of Oxford, Oxford, UK George Nicholson, Stephen
Roberts & Chris Holmes * University of Surrey, Guildford, UK Ivan Kiskin * The Surrey Institute for People-Centred AI, Centre for Vision, Speech and Signal Processing, Guildford, UK Ivan
Kiskin * University of Lancaster, Lancaster, UK Peter Diggle * CHI, MRI, Technical University of Munich, Munich, Germany Björn W. Schuller * University of Nottingham, Nottingham, UK Tracey
Thornley Authors * Jobie Budd View author publications You can also search for this author inPubMed Google Scholar * Kieran Baker View author publications You can also search for this author
inPubMed Google Scholar * Emma Karoune View author publications You can also search for this author inPubMed Google Scholar * Harry Coppock View author publications You can also search for
this author inPubMed Google Scholar * Selina Patel View author publications You can also search for this author inPubMed Google Scholar * Richard Payne View author publications You can also
search for this author inPubMed Google Scholar * Ana Tendero Cañadas View author publications You can also search for this author inPubMed Google Scholar * Alexander Titcomb View author
publications You can also search for this author inPubMed Google Scholar * David Hurley View author publications You can also search for this author inPubMed Google Scholar * Sabrina
Egglestone View author publications You can also search for this author inPubMed Google Scholar * Lorraine Butler View author publications You can also search for this author inPubMed Google
Scholar * Jonathon Mellor View author publications You can also search for this author inPubMed Google Scholar * George Nicholson View author publications You can also search for this
author inPubMed Google Scholar * Ivan Kiskin View author publications You can also search for this author inPubMed Google Scholar * Vasiliki Koutra View author publications You can also
search for this author inPubMed Google Scholar * Radka Jersakova View author publications You can also search for this author inPubMed Google Scholar * Rachel A. McKendry View author
publications You can also search for this author inPubMed Google Scholar * Peter Diggle View author publications You can also search for this author inPubMed Google Scholar * Sylvia
Richardson View author publications You can also search for this author inPubMed Google Scholar * Björn W. Schuller View author publications You can also search for this author inPubMed
Google Scholar * Steven Gilmour View author publications You can also search for this author inPubMed Google Scholar * Davide Pigoli View author publications You can also search for this
author inPubMed Google Scholar * Stephen Roberts View author publications You can also search for this author inPubMed Google Scholar * Josef Packham View author publications You can also
search for this author inPubMed Google Scholar * Tracey Thornley View author publications You can also search for this author inPubMed Google Scholar * Chris Holmes View author publications
You can also search for this author inPubMed Google Scholar CONTRIBUTIONS J.B.: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Project administration,
Software, Visualization, Writing – original draft, Writing – review & editing. K.B.: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Software, Validation,
Visualization, Writing – original draft, Writing – review & editing. E.K.: Project administration, Writing – original draft, Writing – review & editing. H.C.: Conceptualization, Data
curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – review & editing. S.P.: Funding acquisition, Investigation, Methodology, Project
administration, Resources, Supervision, Writing – review & editing. R.P.: Data curation, Investigation, Software, Supervision, Validation, Writing – review & editing. A.T.C.:
Investigation, Project administration, Resources, Supervision, Writing – review & editing. A.T.: Project administration, Resources, Software, Supervision, Validation, Visualization,
Writing – original draft, Writing – review & editing. D.H.: Data curation, Investigation, Software, Supervision, Validation, Writing – original draft. S.E.: Funding acquisition, Project
administration, Resources, Writing – review & editing. L.B.: Investigation, Project administration, Resources, Writing – review & editing. J.M.: Supervision, Writing – review &
editing. G.N.: Data curation, Formal Analysis, Investigation, Methodology, Validation, Writing – review & editing. I.K.: Conceptualization, Data curation, Formal Analysis, Investigation,
Methodology, Software, Supervision, Writing – review & editing. V.K.: Conceptualization, Investigation, Methodology, Writing – review & editing. R.J.: Validation, Writing – review
& editing. R.A.M.: Supervision, Writing – review & editing. P.D., S.R.: Supervision. B.S.: Conceptualization, Formal Analysis, Investigation, Methodology, Supervision, Writing –
review & editing. S.G.: Formal Analysis, Funding acquisition, Methodology, Project administration, Supervision. D.P.: Formal Analysis, Supervision, Validation, Writing – review &
editing. S.R.: Conceptualization, Formal Analysis, Investigation, Methodology, Project administration, Supervision, Writing – review & editing. J.P.: Conceptualization, Funding
acquisition, Investigation, Project administration, Resources, Supervision, Writing – review & editing. T.T.: Conceptualization, Investigation, Resources, Supervision, Writing – review
& editing. C.H.: Conceptualization, Investigation, Methodology, Project administration, Resources, Supervision, Writing – review & editing. CORRESPONDING AUTHOR Correspondence to
Emma Karoune. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SURVEY QUESTIONS AND RESPONSE OPTIONS GRANULAR RECRUITMENT INFORMATION PARTICIPANT METADATA
DICTIONARY AUDIO METADATA DICTIONARY RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons
licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a
credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT
THIS ARTICLE CITE THIS ARTICLE Budd, J., Baker, K., Karoune, E. _et al._ A large-scale and PCR-referenced vocal audio dataset for COVID-19. _Sci Data_ 11, 700 (2024).
https://doi.org/10.1038/s41597-024-03492-w Download citation * Received: 22 February 2024 * Accepted: 10 June 2024 * Published: 27 June 2024 * DOI: https://doi.org/10.1038/s41597-024-03492-w
SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy
to clipboard Provided by the Springer Nature SharedIt content-sharing initiative