About this page
This page provides details of the summary pages which have been created for many of the recording of talks held on Zoom.
About the Zoom talks
A series of online talks, workshops and panel sessions were launched shortly after the Covid lockdown prevented sides from practicing and dancing.
The talks were held on Zoom. Recordings were made of many of the talks and, following checks from the speakers, video recordings of the talks were published on YouTube.
At a later date it was felt desirable to add automated captions to the recordings, based on YouTube’s automated captioning tool.
About the summaries
During 2024 it was realised that the automated captions could be fed into summarising software in order to create short summaries of the talks.
Typically talks lasted from 1-2 hours. It was felt that a 500 word summary would provide a useful summary of the talks, and help potential viewers of recordings to decide whether the talk would be of interest to them.
Once the summaries had been made we attempted to contact the speakers in order to receive feedback on the accuracy of the content. We are pleased to say that the summaries were generally felt to be accurate, with only a small number of, mostly minor tweaks to the content requested.
Summary of technologies used
Generation of text from speech
The source material were the YouTube recordings of talks delivered on Zoom.
Initially the automated captions provided by YouTube were published, which provide a useful tool for viewers who may have hearing disabilities or other reasons why captions can be helpful such as watching videos in a noise environment.
Only a small amount of edits were made to the captions, such as occasional deletion of ‘ums‘ and ‘errs‘.
It seems that by default YouTube uses a number of AI techniques to provide the captions:
- Google Cloud Speech-to-Text is at the core of YouTube’s automatic captioning system.
- Automatic Speech Recognition (ASR) identifies spoken words and converts them into text.
- It uses machine learning models trained on vast datasets of spoken language, including various accents and dialects.
- YouTube uses Google’s proprietary NLP models, including BERT (Bidirectional Encoder Representations from Transformers). NLP helps the system handle punctuation, grammar, and context, improving the accuracy of captions.
Speaker Diarisation may be used to differentiate between different speakers in videos
- Google’s Time Synchronisation Algorithms ensure that captions are synchronised with spoken words by aligning detected phonemes with video timestamps.
- Deep Neural Networks (DNNs) are used to continuously improve accuracy by learning from user interactions (e.g., manual caption edits, corrections, and user feedback).
- YouTube’s captioning uses context-aware models to distinguish homophones and apply correct words based on sentence context.
- YouTube may apply Content-Specific Language Models for certain content types, especially when dealing with niche terminology or technical discussions.
Creation of summaries
The summaries were then fed into the ChatGPT-4o AI tool using a simple prompt:
Give a 500 word summary of this transcript of a talk on “xxx” by xxx which was organised by the Morris Federation and held on Zoom on xx mmm 202n.
Initially it was expected that the prompt would have to be refined in order to give a more accurate summary of talks which typically would feature regional accents and specialist morris-related vocabularies. However following feedback from speakers it was found this was not the case – the only significant misunderstanding came from the speech to text conversion for one talk in which the word “maypole” (as in maypole dancing) was given as “maple“!
Author: Brian Kelly
Created: 15 Feb 2025