From c1a0e3068efc53c4816712dc0a8236428bce02d0 Mon Sep 17 00:00:00 2001 From: Roberto Cai Date: Fri, 15 Mar 2019 15:04:09 +0100 Subject: [PATCH] Add first draft of speech tutorial --- domestic_robotics/speech_recognition.ipynb | 325 +++++++++++++++++++++ 1 file changed, 325 insertions(+) create mode 100644 domestic_robotics/speech_recognition.ipynb diff --git a/domestic_robotics/speech_recognition.ipynb b/domestic_robotics/speech_recognition.ipynb new file mode 100644 index 0000000..c3c32e3 --- /dev/null +++ b/domestic_robotics/speech_recognition.ipynb @@ -0,0 +1,325 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython import display" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Speech recognition tutorial\n", + "\n", + "Our speech module is much smaller compared to other components such as perception. However, it plays an important role in enabling human-machine interaction.\n", + "\n", + "This tutorial will focus on the implementation of speech recognition that is currently being used by the @Home team, which relies on [SpeechRecognition](https://github.com/Uberi/speech_recognition), which is a python library that supports several online and offline speech recognition engines and APIs. Additionally, this tutorial will also cover __Kaldi__ and how it was integrated into the speech recognition pipeline.\n", + "\n", + "This tutorial depends on the following:\n", + "\n", + "- [SpeechRecognition](https://github.com/Uberi/speech_recognition)\n", + " - CMU's [PocketSphinx](https://pypi.org/project/pocketsphinx/) for offline speech recognition (Optionally, in case you do not want to use Google Speech) \n", + "- [Kaldi](https://github.com/kaldi-asr/kaldi) - speech recognition toolkit\n", + "- [py-kaldi-asr](https://github.com/gooofy/py-kaldi-asr) - a python wrapper for Kaldi\n", + "\n", + "The basic idea of speech recognition is to record speech in the form of a sound wave and convert it into a digital representation of the wave. Once we have this audio data, we can use it as input to a model to transcribe the audio into text. I will not delve into what tipes of models since it is not the focus of the tutorial, but keep in mind that there are several existing API's using SOA methods (Google for online and Kaldi for offline in this case). \n", + "\n", + "## List Contents:\n", + "\n", + "- [Installation](#Installation)\n", + " - [Requirements](#Requirements)\n", + " - [SpeechRecognition](#SpeechRecognition)\n", + " - [Kaldi](#Kaldi)\n", + " - [py-kaldi-asr](#py-kaldi-asr)\n", + " - [Kaldi pre-trained models](#Kaldi pre-trained models)\n", + "- [Python SpeechRecognition](#Python SpeechRecognition)\n", + " - [Recognizer Class](#Recognizer Class)\n", + " - [Working with a microphone](#Working with a microphone)\n", + " - [Working Example](#Working Example)\n", + "- [mdr_speech_recognition](#mdr_speech_recognition)\n", + "\n", + "## Installation\n", + "\n", + "### Requirements\n", + "\n", + "* **Python** d2.7, or 3.3+ (required)\n", + "* **PyAudio** 0.2.11+ (required only if you need to use microphone input, ``Microphone``)\n", + "* **PocketSphinx** (required only if you need to use the Sphinx recognizer, ``recognizer_instance.recognize_sphinx``)\n", + "* **Google API Client Library for Python** (required only if you need to use the Google Cloud Speech API, ``recognizer_instance.recognize_google_cloud``)\n", + "* **FLAC encoder** (required only if the system is not x86-based Windows/Linux/OS X)\n", + "* **wget** for additional non-kaldi packages.\n", + "* **Standard UNIX utilities**: bash, perl, awk, grep, and make.\n", + "* Linear-algebra package such as **ATLAS**, **CLAPACK**, or **OpenBLAS**.\n", + "* **Cython**\n", + "\n", + "\n", + "### SpeechRecognition\n", + "\n", + "To install the SpeechRecognition API is as easy to type:\n", + "\n", + "``pip install SpeechRecognition``\n", + "\n", + "However, this API currently does not support Kaldi. For the purpose of this tutorial, install from here:\n", + "\n", + "``git clone -b feature/py-kaldi-asr_support https://github.com/robertocaiwu/speech_recognition.gitt``\n", + "\n", + "``cd speech_recognition && python setup.py install --user``\n", + "\n", + "\n", + "### Kaldi\n", + "\n", + "1. ``git clone https://github.com/kaldi-asr/kaldi.git``\n", + "2. Navigate into the /Kaldi/tools folder and run ./extras/check dependencies.sh\n", + "to check for additional dependencies that are needed for installation.\n", + "3. After installing necessary dependencies, compile by running ``make -j #ofcores`` \n", + "4. Navigate into the /Kaldi/src folder and ``run ./configure –shared`` \n", + "5. ``make depend -j #ofcores`` \n", + "6. ``make check -j #ofcores`` \n", + "7. ``make -j #ofcores``\n", + "\n", + "### py-kaldi-asr\n", + "\n", + "``pip install py-kaldi-asr``\n", + "\n", + "### Kaldi pre-trained models\n", + "\n", + "You can find pre-trained models in various languages from the official [Kaldi documentation](https://kaldi-asr.org/models.html) page and in [zamia-speech](https://goofy.zamia.org/zamia-speech/asr-models/).\n", + "\n", + "The specific one used in this tutorial is a model trained with 1200 hours of audio in English language which can be downloaded from [here](#https://goofy.zamia.org/zamia-speech/asr-models/kaldi-generic-en-tdnn_f-r20190227.tar.xz). This model has decent background noise resistance and can also be used on phone recordings.\n", + "\n", + "\n", + "## Python SpeechRecognition\n", + "\n", + "### Recognizer Class\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "import speech_recognition as sr\n", + "\n", + "rec = sr.Recognizer()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The ``Recognizer`` creates a new instance which represents a collection of speech recognition functionality. This class supports different APIs for recognizing speech:\n", + "\n", + "\n", + "* ``recognize_sphinx()`` [CMU Sphinx](http://cmusphinx.sourceforge.net/wiki/) (works offline)\n", + "* ``recognize_google()`` Google Speech Recognition\n", + "* ``recognize_google_cloud()`` [Google Cloud Speech API](https://cloud.google.com/speech/)\n", + "* ``recognize_wit()`` [Wit.ai](https://wit.ai)\n", + "* ``recognize_bing()`` [Microsoft Bing Voice Recognition](https://www.microsoft.com/cognitive-services/en-us/speech-api)\n", + "* ``recognize_houndify()`` [Houndify API](https://houndify.com/)\n", + "* ``recognize_ibm()`` [IBM Speech to Text](http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/speech-to-text.html)\n", + "\n", + "Out of the box, we can only use Google Speech Recognition. For the rest (with the exception of pocketsphinx), we need a username/password combination to use the online service.\n", + "\n", + "
\n", + "Caution: The default key provided for Google Speech Recognition is for testing purposes only, and Google may revoke it at any time. It is not a good idea to use the Google Web Speech API in production. Even with a valid API key, you’ll be limited to only 50 requests per day, and there is no way to raise this quota. Fortunately, SpeechRecognition’s interface is nearly identical for each API, so what you learn today will be easy to translate to a real-world project. [source](https://realpython.com/python-speech-recognition/#the-recognizer-class)\n", + "
\n", + "\n", + "However, if you installed the suggested package for this tutorial, the Recognizer class will also include:\n", + "\n", + "* ``recognize_kaldi()`` [Kaldi](https://github.com/kaldi-asr/kaldi)\n", + "\n", + "### Working with a microphone" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "mic = sr.Microphone()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The ``Microphone`` class creates an instance which represents a physical microphone on the computer. This class has three parameters:\n", + "\n", + "- device_index\n", + "- sample_rate\n", + "- chuck_size\n", + "\n", + "If ``device_index`` is unspecified or ``None``, the default microphone is used as the audio source. Otherwise, ``device_index`` should be the index of the device to use for audio input.\n", + "\n", + "The microphone audio is recorded in chunks of ``chunk_size`` samples, at a rate of ``sample_rate`` samples per second (Hertz). If not specified, the value of ``sample_rate`` is determined automatically from the system's microphone settings.\n", + "\n", + "Higher ``sample_rate`` values result in better audio quality, but also more bandwidth (and therefore, slower recognition). Additionally, some CPUs, such as those in older Raspberry Pi models, can't keep up if this value is too high.\n", + "\n", + "Higher ``chunk_size`` values help avoid triggering on rapidly changing ambient noise, but also makes detection less sensitive. This value, generally, should be left at its default.\n", + "\n", + "It is recommended to use a sample rate of 16,000 Hertz since the vast mayority of models are trained with audio data using this same sample rate.\n", + "\n", + "### Working Example" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "loading model...\n", + "loading model... done.\n" + ] + } + ], + "source": [ + "model_directory='/home/rob/speech/data/models/tdnn-en'\n", + "\n", + "mic = sr.Microphone()\n", + "rec = sr.Recognizer()\n", + "rec.load_kaldi_model(model_directory=model_directory)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Listening ...\n", + "You said: hello hello \n" + ] + } + ], + "source": [ + "# Start recording\n", + "with mic as source:\n", + " rec.adjust_for_ambient_noise(source)\n", + " print('Listening ...')\n", + " audio = rec.listen(source)\n", + "\n", + " # Decode audio from microphone\n", + " s = rec.recognize_kaldi(audio)\n", + " print (\"You said: \", s[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## mdr_speech_recognition\n", + "\n", + "Currently in the mdr_speech_recognition package, only pocketsphinx and the default google speech are being used. Integrating the Kaldi wrapper is WIP. \n", + "\n", + "There is a reason for deciding to use both engines. By default we want to use Google Speech API as it provides with the best recognition accuracy. However, this requires the robot to always maintain a stable internet connection. In the case when it is not possible to connect to the internet, we need a fallback plan to still be able to recognize speech which in this case is Pocketsphinx.\n", + "\n", + "This package is completely robot independent. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def recognize(self):\n", + " with self.microphone as source:\n", + " self.recognizer.adjust_for_ambient_noise(source)\n", + "\n", + " try:\n", + " while not rospy.is_shutdown():\n", + " rospy.loginfo('Listening...')\n", + " with self.microphone as source:\n", + " audio = self.recognizer.listen(source)\n", + " rospy.loginfo('Got a sound; recognizing...')\n", + "\n", + " \"\"\"\n", + " Google over PocketSphinx: In case there is a internet connection\n", + " use google, otherwise use pocketsphinx for speech recognition.\n", + " \"\"\"\n", + " recognized_speech = \"\"\n", + " if SpeechRecognizer.check_internet_connection():\n", + " try:\n", + " recognized_speech = self.recognizer.recognize_google(audio)\n", + " except sr.UnknownValueError:\n", + " rospy.logerr(\"Could not understand audio.\")\n", + " except sr.RequestError:\n", + " rospy.logerr(\"Could not request results.\")\n", + " else:\n", + " try:\n", + " recognized_speech = self.recognizer.recognize_sphinx(audio)\n", + " except sr.UnknownValueError:\n", + " rospy.logerr(\"Could not understand audio.\")\n", + " except sr.RequestError:\n", + " rospy.logerr(\"Could not request results.\")\n", + "\n", + " if recognized_speech != \"\":\n", + " rospy.loginfo(\"You said: \" + recognized_speech)\n", + " self.pub.publish(recognized_speech)\n", + "\n", + " except Exception as exc:\n", + " rospy.logerr(exc)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The mdr_speech_recognition package is completely robot independent. You can run it by building the corresponding package \n", + "\n", + "``catkin build mdr_speech_recognition``\n", + "\n", + "and then launching:\n", + "\n", + "``roslaunch mdr_speech_recognition speech_recognition.launch ``\n", + "\n", + "If you listen to the **/speech_recognizer** topic, you will get the text output of the speech recognition API." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "speech", + "language": "python", + "name": "speech" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}