Tutorials

Monday, 17 October 2016

The ISCSLP 2016 Organising Committee is pleased to announce the following 6 tutorials presented by distinguished speakers at the conference and will be offered on Monday, 17 September 2016. All Tutorials will be of two-hour duration, and registration fee is free for ISCSLP 2016 delegates.

The tutorial handouts will be provided electronically, ahead of the tutorials. Please download and print at your convenience, as we will not be providing hard copies of these at the conference.

Tutorials

9:30 - 11:30

ISCSLP-T1

Deep Learning for Statistical Parametric Speech Synthesis
- Zhen-Hua Ling

ISCSLP-T2

Speech front-end processing under multi-sources reverberant acoustic environments
- Qiang Fu, Xiaofei Wang

ISCSLP-T3

Techniques & Applications For Speech Interaction Between Human And Cloud Robot
- Min Chu, Zhijie Yan, Jian Sun, Yining Chen

 

13:30 - 15:30

ISCSLP-T4

Deep Learning: Recent Advances and Moving Forward
- Dong Yu

ISCSLP-T5

Undirected Graphical Models: Theory and Applications to Speech and Language Processing
- Zhijian Ou

 

16:00 - 18:00

ISCSLP-T6

Emotion Recognition in Speech, Text and Conversational Data
- Junlan Feng, Chaomin Wang, Yanmeng Wang

ISCSLP-T7

Automatic Speaker Verification: State of the Art, Spoofing and Countermeasures
- Zhizheng Wu, Haizhou Li



ISCSLP-T1

Title: Deep Learning for Statistical Parametric Speech Synthesis
Presenter: Zhen-Hua Ling

Abstract: Since 2006, deep learning has appeared as a new area of machine learning research and has attracted the attention of many signal processing researchers. Both generative deep architectures (e.g., restricted Boltzmann machines (RBM), and deep belief networks (DBN)) and discriminative deep architectures (e.g., deep auto-encoder (DAE), deep neural networks (DNN), and recurrent neural networks with long short term memory (LSTM-RNN)) have been intensively studied and explored by signal processing researchers in recent years. After the successful application of DNNs to the acoustic modeling of automatic speech recognition (ASR), deep learning techniques have also been applied to statistical parametric speech synthesis (SPSS) in recent years to deal with the limitations of conventional approaches. In this tutorial, I will first review the conventional framework of SPSS using Gaussian-HMMs for acoustic modeling. Then I will give introductions to the key techniques of deep learning, including RBM, DBN, DNN, and LSTM-RNN. Their model structures and training algorithms will be explained, especially those specific to speech synthesis as distinct from other applications such as speech recognition. Then, the various implementations of deep learning methods for statistical parametric speech synthesis will be reviewed, including deep learning based feature representation, deep learning based acoustic modeling and deep learning based post-filtering. The acoustic modeling methods using deep learning techniques will be the emphasis of this tutorial. I will also discuss the difference between the deep learning techniques for ASR and these deep learning based speech synthesis methods for providing insights into the technical challenges encountered by different applications using deep learning.

Biography: Zhen-Hua Ling received the B.E. degree in electronic information engineering, M.S. and Ph.D. degree in signal and information processing from University of Science and Technology of China, Hefei, China, in 2002, 2005, and 2008, respectively. From October 2007 to March 2008, he was a Marie Curie Fellow at the Centre for Speech Technology Research (CSTR), University of Edinburgh, UK. From July 2008 to February 2011, he was a joint Postdoctoral Researcher at the University of Science and Technology of China and iFLYTEK Co., Ltd., China. He is currently an Associate Professor at the University of Science and Technology of China. He also worked at the University of Washington, USA, as a Visiting Scholar from August 2012 to August 2013. His research interests include speech processing, speech synthesis, voice conversion, speech analysis, and speech coding. He was awarded IEEE Signal Processing Society Young Author Best Paper Award in 2010 and is now an associate editor of IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[Back to Top]


ISCSLP-T2

Title: Speech front-end processing under multi-sources reverberant acoustic environments
Presenters: Qiang Fu, Xiaofei Wang, Institute of Acoustics, Chinese Academy of Sciences.
Email: fuqiang@mail.ioa.ac.cn,wangxiaofei@hccl.ioa.ac.cn
Tel: 13436590729,13522669165

Tutorial description: Speech front-end processing aims at obtaining the clean speech (as close as to the close-talking speech) in arbitrary scenarios including close-talking mode and distant-talking mode. In the distant talking (far-field) mode, possible “contaminations” such as acoustic echoes, coherent interferences, background noise and room reverberation, would be added together and be more serious with distance increases. In this tutorial, speech front-end processing methods are categorized from the perspective of physically modelling and data-driven modelling to tackle the possible “contaminations”, which well demonstrates the history and methodology of the research area. The details of the tutorial are listed as follows.
Firstly, from the perspective of physically modelling, state-of-the-art single-channel and multi-channel front-end processing in time, spectral and spatial domain are presented. In more detail, Echo cancellation, target speech detection/voice activity detection, linear/nonlinear beamforming and dereverberation are all taken into consideration, which results in a brief summary of classical methods in the past two decades and drawing some conclusions. As a result, some available improvements of speech front-end processing for real applications are well organized.
Secondly, from the perspective of data-driven modelling, in recently years, non-negative matrix factorization (NMF) has been widely used in many audio application, such as source separation and speech enhancement. The clean speech spectrum is estimated by the linear combination of the speech bases weighted by their corresponding activations. For tackling problems of non-stationary noise, speech-like interference or multi-source scenarios, the properties of speech, such as temporal dependency and sparsity, are available ways. With limited training data or with low rank spectral structures noise, NMF is able to improve the SNR significantly. Deep Neural Network (DNN) is another emerging data-driven technique in front-end processing, especially after the comprehensive success in speech recognition, image recognition and many other areas. The intuitive usage such as Time-Frequency classification and spectral mapping has shown promising results. It shows more capability in tackling the non-stationary noise, which has challenged traditional speech enhancement approaches for decades.
However, data-driven modelling does not solve the problem for all nor does it help us to know more about speech. DNN as inherently a supervised approach only learns what it is taught to learn. Modeling how speech is produced and how speech is perceived has been much the work of the signal processing. So let DNN do what it can do best, that is to learn the prior information from training data, and the signal processing part decide how to use that information. So some recent research updates are presented based on this starting point, which well combines traditional signal processing and machine learning.
At last, a brief introduction of the real applications is presented, which proves the effectiveness of the speech front-end processing.

Significance and relevance: Speech front-end processing plays an important role for various applications, such as speech recognition and speech communication. For free-style human-computer speech interaction, the essence of speech front-end processing should be summarized and generates critical and available methods in multi-sources reverberant acoustic environments.

Biography of presenter: Qiang Fu received the Ph.D. degree in electronic engineering from Xidian University, Xi'an, in 2000. In 2000, he was working at a Researcher in Motorola China Research Center (MSRC), Shanghai, China. From 2001 to 2002, he was working as a senior Research Associate in Center for Spoken Language Understanding (CSLU), OGI School of Science and Engineering at Oregon Health\&Science University, Oregon, USA. From 2002 to 2004, he was working as Senior Postdoctoral Research Fellow in Department of Electric and Computer Engineering, University of Limerick, Ireland. He is currently a Professor in Institute of Acoustics, Chinese Academy of Sciences, China. His research interests are in speech analysis, microphone array processing, distant-taking speech recognition, audio-visual signal processing, machine learning for signal processing, etc. Dr. Qiang Fu is a member of IEEE Signal Processing Society. He get the achievements of outstanding science and technology China Academy Award at 2014.
Xiaofei Wang received the B.E. degree from Huazhong University of Science and Technology, Wuhan, in 2010 and Ph.D. degree in Signal and Information Processing from University of Chinese Academy of Sciences, Beijing, in 2015, respectively. From July 2015, he is an assistant professor in Institute of Acoustics, Chinese Academy of Sciences, Beijing, China. His current research focuses on distant-talking speech recognition, speech enhancement and machine learning for signal processing, etc.

Relevant publications from presenter(s):
2016
[1] Yueyue Na, Yanmeng Guo, Qiang Fu, Yonghong Yan, Cross Array and Rank-1 MUSIC Algorithm for Acoustic Highway Lane Detection, IEEE Transactions on Intelligent Transportation Systems, 2016 Accepted
[2] Chao Wu, Xiaofei Wang, Yanmeng Guo, Qiang Fu, Yonghong Yan, Robust Uncertainty Control of the Simplified Kalman Filter for Acoustic Echo Cancelation, Circuits Syst Signal Process, Feb., 2016
2015
[1] Y. Na, Y. Guo, Q. Fu, and Y. Yan, "An Acoustic Traffic Monitoring System: Design and Implementation " presented at the 12th IEEE International Conference on Ubiquitous Intelligence and Computing (UIC2015), 2015.
[2] Chao Wu, et al, Robust beamforming using beam-to-reference weighting diagonal loading and Bayesian framework (2015), in: Electronics Letters, 51:22(1772--1774)
[3] C. Wu, X. Wang, Y. Guo, Q. Fu, and Y. Yan, "Robust Huber M-estimator based proportionate affine projection algorithm with variable cutoff updating," Electronics Letters, vol. 51, pp. 2113-2115, 2015.
[4] 王晓飞, 国雁萌, 葛凤培, 吴超, 付强, and 颜永红, "具有选择注意能力的语音拾取技术 Speech-picking for speech systems with auditory attention ability," 中国科学:信息科学 Scientia Sinica Informationis, vol. 2015, 2015.
[5] X. Wang, Y. Guo, C. Wu, Q. Fu, and Y. Yan, "A reverberation robust target speech detection method using dual-microphone in distant-talking scene," Speech Communication, vol. 72, pp. 47-58, 2015.
2014
[1] Chao Wu, Kaiyu Jiang, Xiaofei Wang, Yanmeng Guo, Qiang Fu, and Yonghong Yan, "A robust step-size control technique based on proportionate constraints on filter update for acoustic echo cancellation," Chinese Journal of Electronics, vol.?, 2014
[2] Chao Wu, Kaiyu Jiang, Yanmeng Guo, Qiang Fu, and Yonghong Yan, "A robust step-size control algorithm for frequency domain acoustic echo cancellation," presented at the InterSpeech, Singapore, 2014.
[3] Xiaofei Wang, Yanmeng Guo, Qiang Fu, and Yonghong Yan, "Reverberation robust two-microphone target signal detection algorithm with coherent interference," presented at the IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), Xi'an, 2014.
[4] Xiaofei Wang, Yanmeng Guo, Xi Yang, Qiang Fu, and Yonghong Yan, "Acoustic Scene Aware Dereverberation using 2-channel spectral enhancement for REVERB Challenge," presented at the IEEE Workshop on REVERB Challenge, Florence, Italy, 2014.
[5] Xiaofei Wang, Yanmeng Guo, Qiang Fu, and Yonghong Yan, "Speech Enhancement Using Multi-channel Post-filtering with Modified Signal Presence Probability in Reverberant Environment," Chinese Journal of Electronics, vol. 23, pp. 598-604, 2014.
[6] Kaiyu Jiang, Chao Wu, Yanmeng Guo, Qiang Fu, and Yonghong Yan, "Acoustic echo control with frequency-domain stage-wise regression," IEEE Signal Processing Letters, vol. 21, pp. 1265-1269, 2014.
[7] Kaiyu Jiang, Yanmeng Guo, Qiang Fu, and Yonghong Yan, "Controlled cross spectrum whitening for coherence based two-microphone speech enhancement," presented at the 21st International Congress on Sound and Vibration (ICSV), Beijing, China, 2014.
2013
[1] 吴超, 付强, and 颜永红, "基于噪声估计和能量比的双讲检测方法," presented at the 全国人机语音通讯学术会议(NCMMSC), 贵阳, 2013.
[2] 王晓飞, 姜开宇, 国雁萌, 付强, and 颜永红, "基于空间声场扩散信息的混响抑制方法," 清华大学学报:自然科学版, vol. 53, pp. 917-920, 2013.
2012
[1] Y. Guo, K. Li, Q. Fu, and Y. Yan, "A two-microphone based voice activity detection for distant-talking speech in wide range of direction of arrival," presented at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012.
[2] Yanmeng Guo, Kai Li, Qiang Fu, and Yonghong Yan, "Target speech detection based on microphone array using inter-channel phase differences," presented at the IEEE International Conference on Consumer Electronics(ICCE), Las Vegas, USA, 2012.
[3] Kai Li, Qiang Fu, and Yonghong Yan, "Speech enhancement using robust generalized sidelobe canceller with multi-channel post-filtering in adverse environments," Chinese Journal of Electronics, vol. 21, pp. 85-90, 2012.
[4] Kai Li, Yanmeng Guo, Qiang Fu, Junfeng Li, and Yonghong Yan, "Two-microphone noise reduction using spatial information-based spectral amplitude estimation," IEICE Transactions on Information and Systems, vol. E95-D, pp. 1454-1464, 2012.
[5] Kai Li, Yanmeng Guo, Qiang Fu, and Yonghong Yan, "A two microphone-based approach for speech enhancement in adverse environments," presented at the IEEE International Conference on Consumer Electronics(ICCE), Las Vegas, USA, 2012.
2011
[1] Kai Li, Qiang Fu, and Yonghong Yan, "Dual-channel optimally modified log-spectral amplitude estimator using spatial information," presented at the 4th International Congress on Image and Signal Processing, Shanghai, China, 2011.
[2] Kai Li, Qiang Fu, Junfeng Li, and Yonghong Yan, "Noise cross power spectral density estimation using spatial information controlled recursive averaging," presented at the Inter-Noise, Osaka, Japan, 2011.

Target audience: New researchers to the field, research students

Requirements on equipment, internet connection: Speaker
Laser pointer

[Back to Top]


ISCSLP-T3

Title: Techniques & Applications For Speech Interaction Between Human And Cloud Robot
Presenter: Min Chu, Zhijie Yan, Jian Sun, Yining Chen

Abstract: With the rapid progress of Internet of Things and Cloud Computing, more and more data services are emerged over the internet. It has become a challenge for people to reach the right piece of information timely. Speech interaction turns out to be the most convenient modality in such a situation. It helps us to get the information from any device (where a cloud robot sitting behind it), anytime, anywhere with high precision. In this tutorial, we will introduce the rich scenarios and challenges in Alibaba ecosystems, the leading technologies we used to solve these real world problems, ranging from automatic speech recognition (ASR), to natural speech understanding (NLU), question& answer (Q&A), and dialog management (DM) etc.

The performance of ASR has been advanced dramatically over the past several years. The main contributing factors include data feedback loop, deep learning and increased computational power. Many real-world applications, such as transcription of phone conversion and multi-speaker discussions, which have been thought very difficult previously, have become implementable. In this tutorial, we will briefly review some of the recent improvements of ASR systems, and focus on the technologies we have successfully deployed to enable our cloud robot effort to support new commercial products, services and business models.

In the scenarios of man-robot interaction, converting speech into text is far from enough. Natural language understanding and multi-turn dialog are very important capabilities to keep the interaction run smoothly. Furthermore, when the domains or the formative of knowledge base are different, the ways to organize the dialog flow should adapt accordingly. Many different technologies are to be integrated together to achieve a fluent interaction between human and the cloud robot. In this tutorial, we will introduce the major approaches used for building natural language understanding, dialogue manager and question answering. Then we will focus on the technologies how we build the Natural User Interface (NUI) platform, how the NUI platform supports various applications from customer service assistant to personal assistant.

Biography: Min Chu joined Alibaba in 2009. She had led various data-driven projects including user profile/intention mining through multi-source data integration, web/mobile APP analytics, machine translation, ali-input method, etc. Since 2014, she moved back to the speech interaction area and build up the Intelligent Speech Interaction team in Alibaba. The team builds core technologies on top of the cloud computing infrastructures and provides speech interaction experience as services for Alibaba's business and small businesses in its ecosystem. Before joining Alibaba, Min was a lead research in Microsoft Research Asia.

Zhijie Yan joined Alibaba in Feb., 2015. He is now a director in Intelligent Speech Interaction Group, Alibaba Cloud, and focus on developing core technologies for Alibaba speech products and services. Before joining Alibaba, he was a lead researcher at Microsoft Research Asia. His research interests include speech recognition, speech synthesis, speaker recognition and OCR / handwriting recognition. He has published many papers in related area, and is now a senior member of IEEE.

Jian Sun received the Ph.D. degree in Signal and Information Processing from Beijing University of Posts & Telecommunications in 2002. From July 2002 to 2005, he is an assistant professor in Institute of Computing, Chinese Academy of Sciences, China. From May 2008, he worked at Alibaba Group and his current work focuses on intelligent speech interaction, especially on spoken language understanding, dialogue system, etc.

Yining Chen joined Alibaba in Sept. 2009. He has led many big data projects since them. Recently, he's focus is on developing the core technologies and products for Question Answering system in Intelligent Speech Interaction Group, Alibaba Cloud. Before joining Alibaba, he was a researcher at Microsoft Research Asia. His interesting areas include speech recognition, speech synthesis, question answering, web search, vertical search, and recommendation system.

[Back to Top]


ISCSLP-T4

Title: Deep Learning: Recent Advances and Moving Forward
Presenters: Dong Yu, Microsoft Research, dongyu@ieee.org

Tutorial description: Deep learning is a newly emerged area of research in machine learning. In the recent years it has led to huge success in a variety of areas such as speech recognition, image classification, and natural language processing.
In this tutorial I will describe the core concepts and design principles in deep learning using some recent advances as examples. More specifically, I will discuss the non-linear functional combinatorial view, the feature engineering view, the end-to-end optimization view, and the dynamic system view of deep learning systems, and illustrate when deep learning models may be helpful, how conventional models can be integrated with deep learning models and how new deep learning models may be invented and designed for new problems. Recent models developed for speech recognition, speech separation, speaker recognition, image captioning, and reinforcement learning will be used as examples. I will also discuss some promising research directions in deep learning.

Significance and relevance: Deep learning will continue to be the driving force for many speech and NLP tasks.

Biography of presenter: Dr. Dong Yu is a principal researcher at Microsoft Research. His research has been focusing on speech recognition and applications of machine learning techniques. He has published two monographs and over 150 papers in these areas and is the inventor/co-inventor of 60 granted/pending patents. His recent work on the context-dependent deep neural network hidden Markov model (CD-DNN-HMM), which was recognized by the IEEE SPS 2013 best paper award, caused a paradigm shift on large vocabulary speech recognition.
Dr. Dong Yu is currently serving as a member of the IEEE Speech and Language Processing Technical Committee (2013-). He has served as an associate editor of IEEE transactions on audio, speech, and language processing (2011-2015), an associate editor of IEEE signal processing magazine (2008-2011), and the lead guest editor of IEEE transactions on audio, speech, and language processing - special issue on deep learning for speech and language processing (2010-2011).

Relevant publications from presenter: See http://scholar.google.com/citations?user=tMY31_gAAAAJ&hl=en

Target audience: New researchers, students, and faculties

Requirements: No special requirements.

[Back to Top]


ISCSLP-T5

Title: Undirected Graphical Models: Theory and Applications to Speech and Language Processing
Presenters: Zhijian Ou, Department of Electronic Engineering, Tsinghua University, Beijing, China.

Tutorial description: Today various tasks in speech and language processing generally involves statistical modeling, inference and learning. Probabilistic graphical models have emerged as a general framework for describing and applying statistical models, which could be broadly classified into two classes [1]. In the directed graphical models (DGMs, also known as Bayesian networks), the joint distribution is factorized into a product of local conditional probability functions, while in the undirected graphical models (UGMs, also know as Markov random fields, Markov networks) the joint distribution is defined to be proportional to the product of local un-normalized potential functions.

In contrast to the dominant use of directed graphical models in speech and language processing, e.g. Hidden Markov Models (HMMs) and extensions, topic models and extensions, the undirected modeling approach is more studied in image modeling and becomes known to the speech and language processing community mainly through the introduction of Conditional Random Fields (CRFs) for sequence tagging [2]. Recently, various undirected models, e.g. Restricted Boltzmann Machines (RBMs) and Deep Boltzmann Machines (DBMs), along with mixed directed and undirected models, e.g. , Deep Belief Networks (DBNs), have been shown to be important elements in the development and study of deep learning [3]. Moreover, some progress has been made to address the difficult problem of learning random fields for sequential data, as demonstrated by the success in language modeling [4].

Roughly speaking, the main advantages of undirected modeling over directed modeling are : (1) Undirected modeling is more “natural” for certain domains, e.g. relational data; being forced to choose a direction for the edges is awkward. (2) The greater flexibility of the undirected representation in avoiding local normalization and acyclicity requirements can be potentially helpful for more powerful modeling capacity. Eliminating these requirements allows us to easily encode a much richer set of patterns/features.

This course will introduce the general and basic concepts of undirected graphical models and demonstrate how to apply the theory to solve various problems in speech and language processing through a number of case studies.

The tutorial will introduce three major aspects of the theory of undirected graphical models.
(1) Semantics/Representation: What graphical models are. Both directed graphical models (also known as Bayesian networks) and undirected graphical models (also known as Markov random fields, Markov networks) will be introduced, so that the audience can better understand the difference and also the connection between these two modeling approaches.
(2) Inference: Exact algorithms (such as variable elimination and junction tree) will be briefly reviewed; We elaborate on two main classes of approximate algorithms, which are based on variational and Monte Carlo principles respectively.
(3) Learning: maximum likelihood learning from either complete data or incomplete data will be covered.

At the same time, we will present three case studies, all from recent studies in speech and language processing. We not only highlight the significance of the examples themselves, but more importantly use them to illustrate the concepts and algorithms. (1) CRFs and their applications. We not only cover the classic application of CRFs to various sequence tagging problems in natural language processing (NLP), but also present a new application of CRFs to calculating the confidence measures of word candidates for lattice-based audio indexing [5].
(2) RBMs, DBNs, and DBMs. We not only detail the training algorithms but also connect those models to Neural Networks.
(3) Trans-dimensional random fields language models. We introduce a breakthrough in training random fields for sequential data [4].

The organization of the tutorial: 1. Semantics of DGMs and UGMs (10 min). Case study: Introducing CRFs, RBMs, DBNs, DBMs (20 min).
2. Exact inference - variable elimination and junction tree (20 min). Case study: Application to linear-chain CRFs.
3. Approximate inference – variational (10 min). Case study: Application to RBMs, DBNs, and DBMs (10 min).
4. Approximate inference – Monte Carlo (10 min).
5. Learning – Stochastic Approximation (SA), Stochastic Maximum Likelihood (SML), Contrastive Divergence (CD), Persistent Contrastive Divergence (PCD), Generalized Iterative Scaling (GIS), Improved Iterative Scaling (IIS) (20 min). Case study: Application to RBMs, DBNs, and DBMs (10 min).
6. Case study: Trans-dimensional random fields language models (10 min).
7. Summary and future research direction.

References: 1. D. Koller and N. Friedman. “Probabilistic graphical models: principles and techniques.” MIT press, 2009.
2. Eric Fosler-Lussier, et al. “Conditional random fields in speech, audio, and language processing.” Proceedings of the IEEE, 2013.
3. Yoshua Bengio, Aaron Courville, and Pierre Vincent. "Representation learning: A review and new perspectives." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
4. Bin Wang, Zhijian Ou and Zhiqiang Tan. “Trans-dimensional Random Fields for Language Modeling.” Annual Meeting of the Association for Computational Linguistics (Long Paper), Beijing, China, 2015,7.
5. Zhijian Ou, Huaqing Luo. “CRF-based Confidence Measures of Recognized Candidates for Lattice-based Audio Indexing.” ICASSP, Kyoto, Japan, 2012,3.

Novelty of the proposed tutorial: There are few tutorials dedicated to introducing undirected graphical models to the speech and language processing community. We systematically present the theory of undirected graphical models, including semantics, inference and learning. Moreover, most recent developments and studies of UGMs are introduced.

Presentation format: On-site presentation.

Contact information: Email: ozj@tsinghua.edu.cn
Phone: 86-10-62796193 (o), 86-13661228007 (m)

Biography of presenter(s): Zhijan Ou received the B.S. degree with the highest honor in electronic engineering from Shanghai Jiao Tong University in 1998 and the Ph.D. degree in electronic engineering from Tsinghua University in 2003. Since 2003, he has been with the Department of Electronic Engineering of Tsinghua University and is currently an associate professor. From August 2014 to July 2015, he was a visiting scholar at Beckman Institute, University of Illinois at Urbana-Champaign. He is a senior member of IEEE.
He has actively led research projects from National Science Foundation China (NSFC), China 863 High-tech Research and Development Program, and China Ministry of Information Industry, as well as joint-research projects with Intel, Panasonic, IBM, and Toshiba. He was a co-recipient of the Best Paper Award of the National Conference on Man-Machine Speech Communication in 2005. He opens one of the earliest graduate courses in China dedicated to the theory and applications of graphical models since 2004. His recent research interests include speech processing (speech recognition and understanding, speaker recognition, natural language processing), and statistical machine intelligence (particularly with graphical models).

Relevant publications from presenter(s): 1. Haotian Xu, Zhijian Ou. Joint Stochastic Approximation Learning of Helmoltz Machines. International Conference on Learning Representations (ICLR) 2016 Workshop Track, Puerto Rico, USA, 2016,5.
2. Bin Wang, Zhijian Ou and Zhiqiang Tan. Trans-dimensional Random Fields for Language Modeling. Annual Meeting of the Association for Computational Linguistics (Long Paper), Beijing, China, 2015,7.
3. Zhijian Ou, Huaqing Luo. CRF-based Confidence Measures of Recognized Candidates for Lattice-based Audio Indexing. ICASSP, Kyoto, Japan, 2012,3.
4. Nan Ding, Zhijian Ou. Variational nonparametric Bayesian hidden Markov model. ICASSP, Dallas, USA, 2010,3.
5. Yimin Tan, Zhijian Ou. Topic-weak-correlated Latent Dirichlet Allocation. International Symposium on Chinese Spoken Language Processing (ISCSLP), Tainan, Taiwan, 2010,12.
6. Hui Lin, Zhijian Ou. Switching Auxiliary Chains for Speech Recognition based on Dynamic Bayesian Networks. International Conference on Pattern Recognition (ICPR), HongKong, 2006,8.

Target audience: The audience is expected to have a pre-existing working knowledge of probability, statistics, and algorithms. The tutorial would be helpful to anyone who is interested in addressing the problems in speech and language processing using UGMs.

Description of materials that are going to be distributed to participants: The slides of the tutorial.

Requirements: No special requirements.

[Back to Top]


ISCSLP-T6

Title: Emotion Recognition in Speech, Text and Conversational Data
Presenters: Junlan Feng, Chaomin Wang, Yanmeng Wang

Abstract: Emotion Recognition has been widely studied in the last 20 years. Many systems have been built to identify the emotion of a spoken utterance, a document, a short text segment, etc. This tutorial is proposed to systematically survey this field with emphasis on various deep learning based approaches recently published for this task. We will address three aspects. First we will summarize the fundamentals to build an emotion recognition system respectively for speech data and text data including system architecture, features, classification algorithms, available data, as well as remaining problems. Second, we will go through recent work on features and various deep learning-based approaches for detecting emotion in speech and text. Third, we will focus on describing state-of-art emotion recognition technologies on conversational data which could be text or speech, how emotion recognition can assist dialog management and vice-versa [1-11].

The traditional paradigm of emotion recognition for speech is to extract acoustic features such as mfcc, f0 curve, energy, then train classifiers on these representations. The classification approaches could be support vector machines, tree-based models, Maxent, hidden Markov models, neural networks. Emotion in speech can further be divided to several dimensions, such as arousal, valance, power, and expectancy. More complicate models are proposed to predict multi-dimension emotion states. Similarly conventional approaches to text sentimental analysis are based on various textual features. To achieve good performance, it requires comprehensive feature engineering such as n-gram, headword, Part-Of-Speech tags, parsing based features, and so on. Classification approaches are similar to those applied in speech emotion recognition. In our tutorial we will systematically summarize these prior efforts.

As deep learning is being explored in most typical problems in speech and natural language understanding, it also inspires researchers to re-think our ways to do emotion recognition. Various methods in this line have been recently reported [5][6][11]. For instance, deep neural network has been used as a classifier to classify emotions. An emerging trend in text sentimental analysis is the introduction of word vector representation and performing text composition over the learned word vectors for classification. The learned word vectors interact via CNN\RNN\LSTM model to compose meaning of phrase, sentence, and document, to form the vector representation of text for further sentimental classification.

Recently, increasing attention has been directed to emotion recognition on conversational data. Commercial chatting systems and customer care systems especially reply on emotion detection to assist dialog management. In this tutorial, we will go through these advances published in the literature and the work deployed in industry.

This tutorial is designed for researchers and students who are new to this field and also for experts to get an overview of recent state-of-art technologies.

References: [1] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
[2] B. Schuller, S. Steidl, A. Batliner, E. Noth, A. Vinciarelli, ¨ F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet et al., “A survey on perceived speaker traits: Personality, likability, pathology, and the first challenge,” Computer Speech & Language, vol. 29, no. 1, pp. 100–131, 2015.
[3] Y. Chavhan, M. Dhore, and P. Yesaware, “Speech emotion recognition using support vector machine,” International Journal of Computer Applications, vol. 1, no. 20, pp. 6–9, 2010.
[4] Y. Pan, P. Shen, and L. Shen, “Speech emotion recognition using support vector machine,” International Journal of Smart Home, vol. 6, no. 2, pp. 101–108, 2012.
[5] A. Stuhlsatz, C. Meyer, F. Eyben, T. ZieIke, G. Meier, and B. Schuller, “Deep neural networks for acoustic emotion recognition: raising the benchmarks,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 5688–5691.
[6] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep neural network and extreme learning machine.” in Interspeech, 2014, pp. 223–227.
[7] B. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009 emotion challenge.” in INTERSPEECH, vol. 2009. Citeseer, 2009, pp. 312–315.
[8] C. Vaudable and L. Devillers, “Negative emotions detection as an indicator of dialogs quality in call centers,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 5109–5112.
[9] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006.
[10] C.-W. Hsu, C.-C. Chang, C.-J. Lin et al., “A practical guide to support vector classification,” 2003.
[11] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014

Biography: Junlan Feng, Director of China Mobile Big Data and IT Technology Research Institute. Junlan Feng is a Director of China Research – Big Data and IT technology Institute (2013.9-Present). Previously she was an architect of IBM Big Data Product (2013.1-2013.8) and a principal research at AT&T Labs Research (2001.8-2013.1). In 2013, Dr Feng was recruited as a China 1000 Talent Plan Expert. Dr Feng received her Ph.D. in 2001 from Chinese Academy of Sciences. She is an IEEE senior member and had been an IEEE speech and language committee member and IEEE industry committee member. She is a reviewer for major speech, natural language international conferences and journals. She has chaired and organized multiple conferences in these fields. Dr Feng has over 50 publications and has been granted 31 U.S and international patents.

Yanmeng Wang, researcher of China Mobile Big Data and IT Technology Research Institute. Yanmeng Wang is a researcher of China Research – Big Data and IT technology Institute (2015.9-Present). He previously worked at Telecom Beijing Research Institute (2009.2-2015.8) and Telenor Denmark (2007.7-2009.1). Yanmeng Wang received his Master Degree in 2007 from Technical University of Denmark. Yanmeng Wang works on auto-QA system development and call center data analysis. His research areas include natural language processing and deep learning.

Chaomin Wang, researcher of China Mobile Big Data and IT Technology Research Institute. Chaomin Wang is a researcher of China Research – Big Data and IT technology Institute (2012.9-Present). Chaomin Wang is a PhD candidate in Beijing Institute of Technology in China now, received his Master Degree in 2009 from Beijing Institute of Technology. Chaomin Wang works on Intelligence Customer Care system and call center data analysis. His research areas include speech processing and deep learning.

[Back to Top]


ISCSLP-T7

Title: Automatic Speaker Verification: State of the Art, Spoofing and Countermeasures
Presenters: Zhizheng Wu, Haizhou Li

Abstract: Automatic speaker verification (ASV) offers a flexible biometric solution to person authentication. While the reliability of ASV systems is considered sufficient for mass market adoption, there are concerns on the vulnerabilities to spoofing, which refers to an attack whereby a fraudster attempts to manipulate an ASV system by masquerading as another, enrolled person. On the other hand, due to the availability of high quality and low-cost recording devices, such as smartphones, replay spoofing attack are arguably the most accessible and therefore present a significant threat; similarly, speaker adaptation in speech synthesis and voice conversion techniques attempt to mimic a target speaker's voice automatically, and hence present a genuine threat to ASV systems.

The research community has responded to replay, speech synthesis and voice conversion spoofing attacks with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; ASV systems remain vulnerable to spoofing, and a deeper understanding of speaker verification, speech synthesis and voice conversion will be fundamental to the pursuit of spoofing-robust speaker verification. A tutorial on the state-of-the-art speaker verification techniques, spoofing and countermeasures techniques is much needed.

The tutorial will focus on voice conversion spoofing and countermeasures, which attract more attentions in the community. It will also detail the speakers' experience and lessons learned from organizing the ASVspoof 2015 challenge.

Biography: Biography: Zhizheng Wu (University of Edinburgh, UK, zhizheng.wu@ed.ac.uk) received his Ph.D. from Nanyang Technological University, Singapore. He was a visiting researcher at Microsoft Research Asia and the University of Eastern Finland. Since 2014, he is a research fellow in the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. He received the best paper award at the Asia Pacific Signal and Information Processing Association Annual Submit and Conference (APSIPA ASC) 2012, and delivered a tutorial on “Spoofing and AntiSpoofing: A Shared View of Speaker Verification, Speech Synthesis and Voice Conversion” at APSIPA ASC 2015. He co-organised the first Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) at Interspeech 2015 and the first Voice Conversion Challenge (VCC 2016).

Haizhou Li received the B.Sc., M.Sc., and Ph.D degree in electrical and electronic engineering from South China University of Technology, Guangzhou, China in 1984, 1987, and 1990 respectively. Dr Li is currently the Principal Scientist, Department Head of Human Language Technology in the Institute for Infocomm Research (I2R), Singapore. He is also an adjunct Professor at the Nanyang Technological University, National University of Singapore and the University of New South Wales, Australia. His research interests include automatic speech recognition, speaker and language recognition, natural language processing, and computational intelligence. Prior to joining I2R, he taught in the University of Hong Kong (1988-1990) and South China University of Technology (1990-1994). He was a Visiting Professor at CRIN in France (1994-1995), a Research Manager at the AppleISS Research Centre (1996-1998), a Research Director in Lernout & Hauspie Asia Pacific (1999-2001), and the Vice President in InfoTalk Corp. Ltd. (2001- 2003). Dr Li is currently the Editor-in-Chief of IEEE/ACM Transactions on Audio, Speech and Language Processing (2015-2017), a Member of the Editorial Board of Computer Speech and Language (2012-2016), the President of the International Speech Communication Association (2015-2017), and the President of Asia Pacific Signal and Information Processing Association (2015-2016). He was an elected Member of IEEE Speech and Language Processing Technical Committee (2013-2015), the General Chair of ACL 2012, and INTERSPEECH 2014. Dr Li is a Fellow of the IEEE. He was a recipient of the National Infocomm Award 2002 and the President's Technology Award 2013 in Singapore. He was named one the two Nokia Visiting Professors in 2009 by the Nokia Foundation.

[Back to Top]


CONFERENCE