2. Offline Text-to-Speech (TTS)

2. Offline Text-to-Speech (TTS)1. Concept Introduction1.1 What is "TTS"?1.2 Brief Overview of Implementation Principles1. Text Analysis2. Language Processing3. Speech Synthesis4. Sound Waveform Generation2. Code AnalysisKey Code1. TTS Initialization and Calling (largemodel/largemodel/model_service.py)2. TTS backend implementation (largemodel/utils/large_model_interface.py)Code Analysis3. Practical Operations3.1 Configuring Offline TTS3.2 Starting and Testing the Functionality4. Common Problems and Solutions4.1 Playback IssuesIssue 1: The program runs normally without errors, but no sound is heard.


1. Concept Introduction

1.1 What is "TTS"?

TTS technology converts written text into human-readable speech output. It enables computers to "read" text aloud and is widely used in fields such as accessible reading, intelligent assistants, navigation systems, and educational software. Through TTS, users can hear natural, fluent machine-generated human voices, greatly improving the convenience and flexibility of information acquisition.

1.2 Brief Overview of Implementation Principles

The implementation of a TTS system primarily involves the following key steps and technologies:

1. Text Analysis

2. Language Processing

3. Speech Synthesis

4. Sound Waveform Generation

With the advancement of artificial intelligence and machine learning technologies, especially the application of deep learning, TTS systems have not only significantly improved in accuracy but also made significant progress in naturalness and emotional expression, making machine-generated speech increasingly similar to human voices.

2. Code Analysis

Key Code

1. TTS Initialization and Calling (largemodel/largemodel/model_service.py)

2. TTS backend implementation (largemodel/utils/large_model_interface.py)

Code Analysis

The text-to-speech (TTS) function is invoked by the LargeModelService node and implemented by the model_interface class. Its design uses parameter configuration to switch between different backend services.

  1. Initialization Process (model_service.py):

    • During LargeModelService initialization, the init_param_config function reads the Boolean value useolinetts from the ROS parameter server.
    • Based on the value of useolinetts, the system_sound_init function passes either the 'local' or 'oline' string to the self.model_client.tts_model_init method.
    • In large_model_interface.py, the tts_model_init method executes the corresponding initialization logic based on the string parameter received. If the value is 'local', piper.PiperVoice.load is used to load the local model file.
  2. Synthesis and Playback Process (model_service.py):

    • When voice playback is required, the _safe_play_audio function is called.
    • This function first calls the self.model_client.voice_synthesis method, passing in the text to be converted and the target audio path self.tts_out_path.
    • After the voice_synthesis method completes and generates an audio file, _safe_play_audio calls self.play_audio_async to asynchronously play the file.
  3. Backend Implementation Selection (large_model_interface.py):

    • The voice_synthesis method is the backend dispatch center for the TTS functionality. Internally, it selects the execution path by checking the self.model_type property value set during initialization.
    • If self.model_type is 'local', the code block uses Python's wave library to open a WAV file, sets its header parameters (channels, sample bit width, and sampling rate), and then calls self.synthesizer.synthesize to write the synthesized text audio stream directly to the file.
    • If self.model_type is 'oline', the code branch will be executed for different cloud service providers (such as Baidu and Alibaba Cloud).
    • This structure separates the upper-level node call ("Speak this sentence") from the lower-level specific synthesis technology (which library to use, which API to call).

3. Practical Operations

3.1 Configuring Offline TTS

To enable offline TTS, you need to correctly configure yahboom.yaml and large_model_interface.yaml and ensure that the local model is correctly placed.

First, enter the Docker container in the terminal:

If you need to enter the same Docker container and run other commands later, simply enter ./ros2_docker.sh again in the host terminal.

  1. Open the main configuration file:

  2. Modify/confirm the following key configuration:**

    useolinetts Make sure this is set to False to use the local model.

    Select "zh" for Chinese and "en" for English.

  3. Open the model interface configuration file:

  4. Confirm the offline model path:

3.2 Starting and Testing the Functionality

  1. Start the TTS node: Run the following command:

    image-20250807165129463

  2. Send the text to be synthesized: Open a new terminal and run the following command to publish a voice message:

  3. Test: If everything works correctly, you should hear the robot say "Speech Synthesis Test Successful" in a synthesized voice through your speakers.


4. Common Problems and Solutions

4.1 Playback Issues

Issue 1: The program runs normally without errors, but no sound is heard.

Solution:

  1. Check Audio Output: Confirm that your system's audio output device is selected correctly and the volume is not muted. Try playing a standard music file to test the hardware.