I. Anjos, M. Eskenazi, N. Marques, M. Grilo, I. Guimarães, J. Magalhães, S. Cavaco, “Detection of voicing and place of articulation of fricatives with deep learning in a virtual speech and language therapy tutor'', to appear in Proceedings of Interspeech, 2020

I, Anjos, N. Marques, M. Grilo, I. Guimarães, J. Magalhães, S. Cavaco, “Sibilant consonants classification comparison with multi and single-class neural networks”, to appear in Expert systems, Wiley-Blackwell Publishing Ltd, 2020

M. Grilo, I. Guimarães, M. Ascensão, A. Abad, I. Anjos, J. Magalhães and S. Cavaco, The BioVisualSpeech European Portuguese sibilants corpus, in Quaresma P., Vieira R., Aluísio S., Moniz H., Batista F., Gonçalves T. (eds) International Conference on Computational Processing of the Portuguese Language (PROPOR 2020). Lecture Notes in Computer Science, vol 12037, pages 23-33. Springer International Publishing, 2020.

The development of reliable speech therapy computer tools that automatically classify speech productions depends on the quality of the speech data set used to train the classification algorithms. The data set should characterize the population in terms of age, gender and native language, but it should also have other important properties that characterize the population that is going to use the tool. Thus, apart from including samples from correct speech productions, it should also have samples from people with speech disorders. Also, the annotation of the data should include information on whether the phonemes are correctly or wrongly pronounced. Here, we present a corpus of European Portuguese children's speech data that we are using in the development of speech classifiers for speech therapy tools for Portuguese children. The corpus includes data from children with speech disorders and in which the labelling includes information about the speech production errors. This corpus, which has data from 356 children from 5 to 9 years of age, focuses on the European Portuguese sibilant consonants and can be used to train speech recognition models for tools to assist the detection and therapy of sigmatism.

I. Guimarães, M. Ascensão, M. Grilo, Speech sounds data for typically developing European Portuguese children 6-9 years old, to apppear in Proceedings of International Congress of Phonetic Sciences (ICPhS), 2019

Purposes: To identify the European Portuguese (EP) speech sounds competence in children. Methods: A total of 240 children between 6 and 9;11 years old named 37 pictures. Gender and age effect as well as the age limit for EP speech sound mastery were analyzed. The percentage of consonants correct (PCC) were determined. The criteria used were PCC ≥75% (acquired sound) and ≥90% (mastered sound). Results: No gender effect for speech sound development was found in the studied age range. Children with older ages [8- 9;11] showed a slightly significant mean performance than younger ages [6-7;11]. The girls appeared to reach higher mean competence than boys; however, gender effect did not reach significance. At the [6-6;11] years old age range all plosives (except the word-medial /t/ and /g/), four fricatives (/f/, /v/, word-initial /ʃ/ and word-medial /Ʒ/) and two laterals (word-medial /r/ and word initial and medial /R/) are mastered. The other targeted sounds are mastered either at the [7-7;11] or at the [8-8;11] year old range. Conclusion: The EP targeted speech sounds are mastered between 6 and 8;11 years old.

V. Lopes, J. Magalhães, S. Cavaco, Sustained Vowel Game: A Computer Therapy Game for Children with Dysphonia, in Proceedings of Interspeech, 2019

Problems in vocal quality are common in 4 to 12-year-old children, which may affect their health as well as their social interactions and development process. The sustained vowel exercise is widely used by speech and language pathologists for the child's voice recovery and vocal re-education. Nonetheless, despite being an important voice exercise, it can be a monotonous and tedious activity for children. Here, we propose a computer therapy game that uses the sustained vowel exercise to motivate children on doing this exercise often. In addition, the game gives visual feedback on the child's performance, which helps the child understand how to improve the voice production. The game uses a vowel classification model learned with a support vector machine and Mel frequency cepstral coefficients. A user test with 14 children showed that when using the game, children achieve longer phonation times than without the game. Also, it shows that the visual feedback helps and motivates children on improving their sustained vowel productions.

I. Anjos, N. Marques, M. Grilo, I. Guimarães, J. Magalhães, S. Cavaco, Sibilant consonants classification with deep neural networks, in Proceedings of 19th European Conference on Artificial Intelligence (EPIA), 2019

Many children suffering from speech sound disorders cannot pronounce the sibilant consonants correctly. We have developed a serious game that is controlled by the children’s voices in real time and that allows children to practice the European Portuguese sibilant consonants.For this, the game uses a sibilant consonant classifier. Since the game does not require any type of adult supervision, children can practicethe production of these sounds more often, which may lead to faster improvements of their speech.Recently, the use of deep neural networks has given considerable improvements in classification for a variety of use cases, from image classification to speech and language processing. Here we propose to use deep convolutional neural networks to classify sibilant phonemes of EuropeanPortuguese in our serious game for speech and language therapy. We compared the performance of several different artificial neural net-works that used Mel frequency cepstral coefficients or log Mel filterbanks. Our best deep learning model achieves classification scores of 95.48% using a 2D convolutional model with log Mel filterbanks as input features.

V. Lopes, J. Magalhaes, and S. Cavaco, A dynamic difficulty adjustment model for dysphonia therapy games, in Proceedings of 3rd International Conference on Human Computer Interaction Theory and Applications (HUCAPP), pages 137–144, 2019.

Studies on childhood dysphonia have revealed considerable rates for voice disorders in 4 – 12 year-old children. The sustained vowel exercise is widely used as a technique in the vocal (re)education process. However this exercise can become tedious after a short practice. Here, we propose a novel dynamic difficulty adjustment model to be used in a serious game with the sustained vowel exercise to motivate children on practicing this exercise often. The model automatically adapts the difficulty of the challenges in response to the child’s performance. The model is not exclusive to this game and can be used in other games for dysphonia treatment. In order to measure the child’s performance, the model uses parameters that are relevant to the therapy treatment. The proposed model is based on the flow model in order to balance the difficulty of the challenges with the child’s skills.

I. Anjos, M. Grilo, M. Ascensão, I. Guimarães, J. Magalhães, S. Cavaco, A Model for Sibilant Distortion Detection in Children, in Proceedings of the 2018 International Conference on Digital Medicine and Image Processing (DMIP), 2018.

The distortion of sibilant sounds is a common type of speech sound disorder in European Portuguese speaking children. Speech and language pathologists (SLP) use different types of speech production tasks to assess these distortions. One of these tasks consists of the sustained production of isolated sibilants. Using these sound productions, SLPs usually rely on auditory perceptual evaluation to assess the sibilant distortions. Here we propose to use an isolated sibilant machine learning model to help SLPs assess these distortions. Our model uses Mel frequency cepstral coefficients of the isolated sibilant phones from 145 children, and was trained using support vector machines. The analysis of the false negatives detected by the model can give insight into whether the child has a sibilant production distortion. We were able to confirm that there exists a relation between the model classification results and the distortion assessment of professional SLPs. Approximately 66% of the distortion cases identified by the model are confirmed by an SLP as having some sort of distortion or are perceived as being the production of a different sound.

I. Anjos, M. Grilo, M. Ascensão, I. Guimarães, J. Magalhães, S. Cavaco, A serious mobile with visual feedback game for training sibilant consonants, in A. D. Cheok, M. Inami, and T. Romão (eds), Advances in Computer Entertainment Technology, pages 430–450. ACE 2017. Lecture Notes in Computer Science, vol 10714, Springer International Publishing, 2018.

The distortion of sibilant sounds is a common type of speech sound disorder (SSD) in Portuguese speaking children. Speech and language pathologists (SLP) frequently use the isolated sibilants exercise to assess and treat this type of speech errors. While technological solutions like serious games can help SLPs to motivate the children on doing the exercises repeatedly, there is a lack of such games for this specific exercise. Another important aspect is that given the usual small number of therapy sessions per week, children are not improving at their maximum rate, which is only achieved by more intensive therapy. We propose a serious game for mobile platforms that allows children to practice their isolated sibilants exercises at home to correct sibilant distortions. This will allow children to practice their exercises more frequently, which can lead to faster improvements. The game, which uses an automatic speech recognition (ASR) system to classify the child sibilant productions, is controlled by the child’s voice in real time and gives immediate visual feedback to the child about her sibilant productions. In order to keep the computation on the mobile platform as simple as possible, the game has a client-server architecture, in which the external server runs the ASR system. We trained it using raw Mel frequency cepstral coefficients, and we achieved very good results with an accuracy test score of above 91% using support vector machines.

Rallabandi, S., Karki, B., Viegas, C., Nyberg, E., & Black, A. W. (2018), Investigating Utterance Level Representations for Detecting Intent from Acoustics, in Proc. Interspeech 2018, ISCA, pages 516-520, Hyderabad, India, September 2018

Recognizing paralinguistic cues from speech has applications in varied domains of speech processing. In this paper we present approaches to identify the expressed intent from acoustics in the context of INTERSPEECH 2018 ComParE challenge. We have made submissions in three sub-challenges: prediction of 1) self-assessed affect and 2) atypical affect 3) Crying Sub challenge. Since emotion and intent are perceived at suprasegmental levels, we explore a variety of utterance level embeddings. The work includes experiments with both automatically derived as well as knowledge-inspired features that capture spoken intent at various acoustic levels. Incorporation of utterance level embeddings at the text level using an off the shelf phone decoder has also been investigated. The experiments impose constraints and manipulate the training procedure using heuristics from the data distribution. We conclude by presenting the preliminary results on the development and blind test sets.

F. Teixeira, A. Abad, I. Trancoso, Patient Privacy in Paralinguistic Tasks, in Proc. Interspeech 2018, ISCA, pages 3428-3432, Hyderabad, India, September 2018

Recent developments in cryptography and, in particular in Fully Homomorphic Encryption (FHE), have allowed for the development of new privacy preserving machine learning schemes. In this paper, we show how these schemes can be applied to the automatic assessment of speech affected by medical conditions, allowing for patient privacy in diagnosis and monitoring scenarios. More specifically, we present results for the assessment of the degree of Parkinsons Disease, the detection of a Cold, and both the detection and assessment of the degree of Depression. To this end, we use a neural network in which all operations are performed in an FHE context. This implies replacing the activation functions by linear and second degree polynomials, as only additions and multiplications are viable. Furthermore, to guarantee that the inputs of these activation functions fall within the convergence interval of the approximation, a batch normalization layer is introduced before each activation function. After training the network with unencrypted data, the resulting model is then employed in an encrypted version of the network, to produce encrypted predictions. Our tests show that the use of this framework yields results with little to no performance degradation, in comparison to the baselines produced for the same datasets.

A. Pompili , A. Abad, D. Martins de Matos, I. Pavão Martins, Topic coherence analysis for the classification of Alzheimer’s disease, in IberSPEECH 2018, ISCA, pages 281-285, doi: 10.21437/IberSPEECH.2018-59, Barcelona, November 2018

Language impairment in Alzheimer’s disease is characterized by a decline in the semantic and pragmatic levels of language processing that manifests since the early stages of the disease. While semantic deficits have been widely investigated using linguistic features, pragmatic deficits are still mostly unexplored. In this work, we present an approach to automatically classify Alzheimer’s disease using a set of pragmatic features extracted from a discourse production task. Following the clinical practice, we consider an image representing a closed domain as a discourse’s elicitation form. Then, we model the elicited speech as a graph that encodes a hierarchy of topics. To do so, the proposed method relies on the integration of various NLP techniques: syntactic parsing for sentence segmentation into clauses, coreference resolution for capturing dependencies among clauses, and word embeddings for identifying semantic relations among topics. According to the experimental results, pragmatic features are able to provide promising results distinguishing individuals with Alzheimer’s disease, comparable to solutions based on other types of linguistic features.

C. Viegas, S.-H. Lau, R. Maxion, A. Hauptmann, Towards Independent Stress Detection: a Dependent Model using Facial Action Units, International Conference on Content-Based Multimedia Indexing (CBMI), 2018

Our society is increasingly more susceptible to chronic stress. Reasons are daily worries, workload, and the wish to fulfil a myriad of expectations. Unfortunately, long-exposure to stress leads to physical and mental health problems. To avoid the described consequences, mobile applications have been studied to track stress in combination with wearables. However, wearables need to be worn all day long and can be costly. Given that most laptops have inbuilt cameras, using video data for personal tracking of stress levels could be a more affordable alternative. In previous work, videos have been used to detect cognitive stress during driving by measuring the presence of anger or fear through a limited number of facial expressions. In contrast, we propose the use of 17 facial action units (AUs) not solely restricted to those emotions. We used five one-hour long videos from the dataset collected by Lau [1]. The videos show subjects while typing, resting, and exposed to a stressor, being a multitasking exercise combined with social evaluation. We performed binary classification using several simple classifiers on AUs extracted in each video frame and were able to achieve an accuracy of up to 74% in subject independent classification and 91% in subject dependent classification. These preliminary results indicate that the AUs most relevant for stress detection are not consistently the same for all 5 subjects. Also in previous work, using facial cues, a strong person-specific component was found during classification.

C. Viegas, S.-H. Lau, R. Maxion, A. Hauptmann, Distinction of Stress and Non-Stress Tasks using Facial Action Units, in Proceedings of the 20th International Conference on Multimodal Interaction (ICMI), 2018

Long-exposure to stress is known to lead to physical and mental health problems. But how can we as individuals track and monitor our stress? Wearables which measure heart variability have been studied to detect stress. Such devices, however, need to be worn all day long and can be expensive. As an alternative, we propose the use of frontal face videos to distinguish between stressful and nonstressful activities. Affordable personal tracking of stress levels could be obtained by analyzing the video stream of inbuilt cameras in laptops. In this work, we present a preliminary analysis of 114 one-hour long videos. During the video, the subjects perform a typing exercise before and after being exposed to a stressor. We performed a binary classification using Random Forest (RF) to distinguish between stressful and non-stressful activities. As features, facial action units (AUs) extracted from each video frame were used. We obtained an average accuracy of over 97% and 50% for subject dependent and subject independent classification, respectively.

A. Grossinho, J. Magalhães, S. Cavaco, Visual-feedback in an interactive environment for speech-language therapy, in Proceedings of the Workshop on Child Computer Interaction (WOCCI) of the ACM International Conference on Multimodal Interaction, 2017.

By combining visual-feedback and motivational elements, a speech therapy computer-based system can offer new approaches with various advantages when compared to traditional speech therapy techniques. Through visual-feedback and adaptation of traditional speech sound exercises, it is possible to create an engaging environment with motivation focused elements. These elements can be used in an interactive environment that motivates the therapy attendee towards better performances. Hereby we present an interactive gamified environment for speech therapy that combines visual-feedback and motivational components. The results from a survey and a usability study suggest that children can show more interest in the speech therapy sessions when the proposed environment is used.

R. Carrapiço, I. Guimarães, M. Grilo, S. Cavaco, J. Magalhães, 3D Facial Video Retrieval and Management for Decision Support in Speech and Language Therapy, in Proceedings of ACM International Conference on Multimedia Retrieval (ICMR), 2017.

3D video is introducing great changes in many health related ar- eas. The realism of such information provides health professionals with strong evidence analysis tools to facilitate clinical decision processes. Speech and language therapy aims to help subjects in correcting several disorders. The assessment of the patient by the speech and language therapist (SLT), requires several visual and au- dio analysis procedures that can interfere with the patient’s produc- tion of speech. In this context, the main contribution of this paper is a 3D video system to improve health information management processes in speech and language therapy. The 3D video retrieval and management system supports multimodal health records and provides the SLTs with tools to support their work in many ways: (i) it allows SLTs to easily maintain a database of patients’ orofa- cial and speech exercises; (ii) supports three-dimensional orofacial measurement and analysis in a non-intrusive way; and (iii) search patient speech-exercises by similar facial characteristics, using fa- cial image analysis techniques. The second contribution is a dataset with 3D videos of patients performing orofacial speech exercises. The whole system was evaluated successfully in a user study in- volving 22 SLTs. The user study illustrated the importance of the retrieval by similar orofacial speech exercise.

C. Pedrosa, I. Guimarães Contributo para o estudo da fidedignidade do uso do paquímetro na antropometria facial em adultos, Revista Portuguesa de Terapia da Fala (RPTF), vol 5 (I), pp.16-22, 2016.

O objetivo deste estudo é verificar se as medidas resultantes da avaliação da antropometria facial em adultos, com o uso do paquímetro, apresentam reprodutibilidade e repetitividade. Métodos: Quatro indivíduos adultos foram submetidos a avaliação antropométrica facial direta (oito medidas) com o uso do paquímetro. A avaliação decorreu em dois momentos, distanciados por 42 dias, com nove examinadores no primeiro momento e 16 no segundo momento. Foi determinada a fidedignidade inter-examinadores (reprodutibilidade) através do Alfa de Cronbach e a fidedignidade intra-examinadores (repetitividade) com o coeficiente de correlação Ró de Spearman. Resultados: A fidedignidade inter-examinadores (reprodutibilidade) é razoável (α=0.7-0.8) para 78% e 93.7% das medidas no primeiro e segundo momento respetivamente. A fidedignidade intra-examinador (repetitividade) não apresenta significância estatística para todas as medidas, exceto para o terço médio da face (rs=0.83, p<0.05). Conclusão: A antropometria facial com paquímetro digital é uma técnica com reprodutibilidade razoável mas a repetitividade do seu uso não foi robusta no presente estudo.

M. Lopes, J. Magalhães, S. Cavaco, A voice-controlled serious game for the sustained vowel exercise, in Proceedings of Advances in Computer Entertainment Technology Conference (ACE), 2016.

Speech is the main form of human communication. Thus it is important to detect and treat speech sound disorders as early as possible during childhood. When children need to attend speech therapy it is critical to keep them motivated on doing the therapy exercises. Software systems for speech therapy can be a useful tool to keep the child interested in keep practicing the therapy exercises. Several software systems have been developed to assist speech and language therapists during the therapy sessions. However most software focus on articulation disorders while voice disorders have been mostly neglected. Here we propose a voice-controlled serious computer game for the sustained vowel exercise, which is an exercise commonly used in speech therapy to treat voice disorders. The main novelty of this application is the combination of real time speech processing, with the gamification of the speech therapy exercises and the parameterization of the difficulty level.

A. Grossinho, I. Guimarães, J. Magalhães, S. Cavaco, Robust phoneme recognition for a speech therapy environment, in Proceedings of IEEE International Conference on Serious Games and Applications for Health (SeGAH), 2016.

Traditional speech therapy approaches for speech sound disorders have a lot of advantages to gain from computer-based therapy systems. In this paper, we propose a robust phoneme recognition solution for an interactive environment for speech therapy. With speech recognition techniques the motivation elements of computer-based therapy systems can be automated in order to get an interactive environment that motivates the therapy attendee towards better performances. The contribution of this paper is a robust phoneme recognition to control the feedback provided to the patient during a speech therapy session. We compare the results of hierarchical and flat classification, with naive Bayes, support vector machines and kernel density estimation on linear predictive coding coefficients and Mel-frequency cepstral coefficients.

M. Diogo, M. Eskenazi, J. Magalhães, S. Cavaco, Robust Scoring Of Voice Exercises In Computer-Based Speech Therapy Systems, European Signal Processing Conference (EUSIPCO), 2016.

Speech therapy is essential to help children with speech sound disorders. While some computer tools for speech ther- apy have been proposed, most focus on articulation disorders. Another important aspect of speech therapy is voice quality but not much research has been developed on this issue. As a contribution to fill this gap, we propose a robust scoring model for voice exercises often used in speech ther- apy sessions, namely the sustained vowel and the increas- ing/decreasing pitch variation exercises. The models are learned with a support vector machine and double cross vali- dation, and obtained approximately from 73.98% to 85.93% accuracies while showing a low rate of false negatives. The learned models allow classifying the children�s answers on the exercises, thus providing them with real-time feedback on their performance.

R. Carrapiço, A. Mourão, J. Magalhães, S. Cavaco, A comparison of thermal image descriptors for face analysis, in Proceedings of the European Signal Processing Conference (EUSIPCO), 2015.

Thermal imaging is a type of imaging that uses thermographic cameras to detect radiation in the infrared range of the electromagnetic spectrum. Thermal images are particularly well suited for face detection and recognition because of the low sensitivity to illumination changes, color skins, beards and other artifacts. In this paper, we take a fresh look at the problem of face analysis in the thermal domain. We consider several thermal image descriptors and assess their performance in two popular tasks: face recognition and facial expression recognition. The results have shown that face recognition can reach accuracy levels of 91% with Localized Binary Patterns. Also, despite the difficulty of facial expression detection, our experiments have revealed that Haar based features (FCTH - Fuzzy Color and Texture Histogram) offers the best results for some facial expressions

A. Grossinho, S. Cavaco, J. Magalhães, An interactive toolset for speech therapy, in Proceedings of Advances in Computer Entertainment Technology Conference (ACE), 2014.

This paper proposes a novel approach to include biofeedback in speech and language therapy by providing the patient with a visual self-monitoring of his/her performance combined with a reward mechanism in an entertainment environment. We propose a toolset that includes an in-session interactive environment to be used during the therapy sessions. This insession environment provides instantaneous biofeedback and assists the therapist during the session with rewards for the patient’s good performance. It also allows to make audiovisual recordings and annotations of the session for later analysis. The toolset also provides an off-line multimedia application for post-session analysis where the session audio-visual recordings can be examined through browsing, searching, and visualization techniques to plan the future session.


  • M. Lopes, J. Magalhães, S. Cavaco, A voice therapy serious game with difficulty level adaptation, ACM WomENcourage, 2017.
  • BioVisualSpeech - serious games for speech therapy sessions and intensive training, CMU-Portugal Symposium, 2017.
  • BioVisualSpeech - NovaSpeech, CMU-Portugal Symposium, 2017.
  • Carla Viegas, Multimodal Analysis of the Interaction between Motor Speech Disorders and Expressed Emotions Using Machine Learning Techniques, CMU-Portugal Symposium, 2017.
  • Carla Viegas, BioVisualSpeech - a multimodal framework to support speech therapy, Innovation Research Lab Exhibition, Medical Valley Center Erlangen, July 2016.



  • Thomas Rolland and Alberto Abad, Transfer learning in Automatic Speech Recognition for the adaptation of an adult acoustic model to a child acoustic model.
  • Mauro Sansana, ASR Experiment.
  • Hugo Cardoso, A speech therapy game, (report about serious games for childhood apraxia of speech), IST.
  • Vanessa Lopes, Jogo sério com exercícios com a vogal sustentada para a terapia da fala, (master dissertation proposal), FCT.UNL.
  • Carla Viegas, Multimodal Analysis of the Interaction between Motor Speech Disorders and Expressed Emotions Using Machine Learning Techniques (Ph.D. proposal), FCT.UNL.
  • Pedro Ferreira, Automatic sound analysis to improve speech and language therapy, (report on the analysis of diadochokinetics, master dissertation proposal), FCT.UNL.
  • Ivo Anjos, Serious mobile games with fricative consonant exercises for speech therapy, (master dissertation proposal), FCT.UNL.