2. Offline Text-to-Speech (TTS)1. Concept Introduction1.1 What is "TTS"?1.2 Brief Overview of Implementation Principles1. Text Analysis2. Language Processing3. Speech Synthesis4. Sound Waveform Generation2. Code AnalysisKey Code1. TTS Initialization and Calling (largemodel/largemodel/model_service.py)2. TTS backend implementation (largemodel/utils/large_model_interface.py)Code Analysis3. Practical Operations3.1 Configuring Offline TTS3.2 Starting and Testing the Functionality4. Common Problems and Solutions4.1 Playback IssuesIssue 1: The program runs normally without errors, but no sound is heard.
TTS technology converts written text into human-readable speech output. It enables computers to "read" text aloud and is widely used in fields such as accessible reading, intelligent assistants, navigation systems, and educational software. Through TTS, users can hear natural, fluent machine-generated human voices, greatly improving the convenience and flexibility of information acquisition.
The implementation of a TTS system primarily involves the following key steps and technologies:
With the advancement of artificial intelligence and machine learning technologies, especially the application of deep learning, TTS systems have not only significantly improved in accuracy but also made significant progress in naturalness and emotional expression, making machine-generated speech increasingly similar to human voices.
largemodel/largemodel/model_service.py)xxxxxxxxxx# From largemodel/largemodel/model_service.pyclass LargeModelService(Node): def __init__(self): # ... self.system_sound_init() # ... def init_param_config(self): # ... self.declare_parameter('useolinetts', False) self.useolinetts = self.get_parameter('useolinetts').get_parameter_value().bool_value if self.useolinetts: self.tts_out_path = os.path.join(self.pkg_path, "resources_file", "tts_output.mp3") else: self.tts_out_path = os.path.join(self.pkg_path, "resources_file", "tts_output.wav") def system_sound_init(self): """Initialize TTS system""" model_type = "oline" if self.useolinetts else "local" self.model_client.tts_model_init(model_type, self.language) self.get_logger().info(f'TTS initialized with {model_type} model') def _safe_play_audio(self, text_to_speak: str): """ Synthesizes and plays all non-empty messages only in non-text chat mode. """ if not self.text_chat_mode and text_to_speak: try: self.model_client.voice_synthesis(text_to_speak, self.tts_out_path) self.play_audio_async(self.tts_out_path) except Exception as e: self.get_logger().error(f"Safe audio playback failed: {e}")largemodel/utils/large_model_interface.py)xxxxxxxxxx# From largemodel/utils/large_model_interface.pyclass model_interface: # ... def tts_model_init(self,model_type='oline',language='zh'): if model_type=='oline': if self.tts_supplier=='baidu': self.token=self.fetch_token() self.model_type='oline' elif model_type=='local': self.model_type='local' if language=='zh': tts_model=self.zh_tts_model tts_json=self.zh_tts_json elif language=='en': tts_model=self.en_tts_model tts_json=self.en_tts_json self.synthesizer = piper.PiperVoice.load(tts_model, config_path=tts_json, use_cuda=False) def voice_synthesis(self,text,path): if self.model_type=='oline': if self.tts_supplier=='baidu': # ... (Baidu TTS implementation) pass elif self.tts_supplier=='aliyun': # ... (Aliyun TTS implementation) pass elif self.model_type=='local': with wave.open(path, 'wb') as wav_file: wav_file.setnchannels(1) wav_file.setsampwidth(2) wav_file.setframerate(self.synthesizer.config.sample_rate) self.synthesizer.synthesize(text, wav_file)The text-to-speech (TTS) function is invoked by the LargeModelService node and implemented by the model_interface class. Its design uses parameter configuration to switch between different backend services.
Initialization Process (model_service.py):
LargeModelService initialization, the init_param_config function reads the Boolean value useolinetts from the ROS parameter server.useolinetts, the system_sound_init function passes either the 'local' or 'oline' string to the self.model_client.tts_model_init method.large_model_interface.py, the tts_model_init method executes the corresponding initialization logic based on the string parameter received. If the value is 'local', piper.PiperVoice.load is used to load the local model file.Synthesis and Playback Process (model_service.py):
_safe_play_audio function is called.self.model_client.voice_synthesis method, passing in the text to be converted and the target audio path self.tts_out_path.voice_synthesis method completes and generates an audio file, _safe_play_audio calls self.play_audio_async to asynchronously play the file.Backend Implementation Selection (large_model_interface.py):
voice_synthesis method is the backend dispatch center for the TTS functionality. Internally, it selects the execution path by checking the self.model_type property value set during initialization.self.model_type is 'local', the code block uses Python's wave library to open a WAV file, sets its header parameters (channels, sample bit width, and sampling rate), and then calls self.synthesizer.synthesize to write the synthesized text audio stream directly to the file.self.model_type is 'oline', the code branch will be executed for different cloud service providers (such as Baidu and Alibaba Cloud).To enable offline TTS, you need to correctly configure yahboom.yaml and large_model_interface.yaml and ensure that the local model is correctly placed.
First, enter the Docker container in the terminal:
xxxxxxxxxx./ros2_docker.sh
If you need to enter the same Docker container and run other commands later, simply enter ./ros2_docker.sh again in the host terminal.
Open the main configuration file:
xxxxxxxxxxvim ~/yahboom_ws/src/largemodel/config/yahboom.yamlModify/confirm the following key configuration:**
xxxxxxxxxxmodel_service#Model server node parameters ros__parameters language'zh' #Large Model Interface Language useolinettsFalse #Whether to use online speech synthesis (True to use online, False to use offline) regional_setting "China" useolinetts Make sure this is set to False to use the local model.
Select "zh" for Chinese and "en" for English.
Open the model interface configuration file:
xxxxxxxxxxvim ~/yahboom_ws/src/largemodel/config/large_model_interface.yamlConfirm the offline model path:
xxxxxxxxxx# large_model_interface.yaml## Offline speech synthesis (Offline TTS)# Chinese TTS modelzh_tts_model"/root/yahboom_ws/src/largemodel/MODELS/tts/zh/zh_CN-huayan-medium.onnx"zh_tts_json"/root/yahboom_ws/src/largemodel/MODELS/tts/zh/zh_CN-huayan-medium.onnx.json"# English TTS modelen_tts_model"/root/yahboom_ws/src/largemodel/MODELS/tts/en/en_US-libritts-high.onnx"en_tts_json"/root/yahboom_ws/src/largemodel/MODELS/tts/en/en_US-libritts-high.onnx.json"Start the TTS node: Run the following command:
xxxxxxxxxxros2 launch largemodel tts_only.launch.py
Send the text to be synthesized: Open a new terminal and run the following command to publish a voice message:
xxxxxxxxxxros2 topic pub --once /tts_text_input std_msgs/msg/String '{data: "语音合成测试成功"}'Test: If everything works correctly, you should hear the robot say "Speech Synthesis Test Successful" in a synthesized voice through your speakers.
Solution: