Embodied Intelligent Gameplay Core Source Code Interpretation

1. Course Content

  1. Embodied Intelligent Gameplay for large AI models is a complex function that involves the coordinated implementation of multiple node programs. This course explains the core code.

2. Source Code Package Structure

2.1 Package File Structure

The ROS package for the AI large model embodied intelligence is called largemodel. The package path is:

Jetson Orin Nano, Jetson Orin NX Host:

Jetson Nano, Raspberry Pi Host:

You need to first enter Docker.

The package file structure is as follows:

image-20250709183354363

Folder and file functions are explained below:

2.1.1 config

Configuration folder, storing configuration files

Large model interface configuration file, used to configure the API keys and large model parameters for each platform.

Map mapping file, used to map the raster map to real-world regions.

Node configuration file: used to configure core node parameters.

2.1.2 largemodel

Source code folder, core program for the AI large model embodying intelligent gameplay.

Speech recognition program file, used to run the speech recognition model and interact with the user.

Model server program file, used to call various model interfaces to implement the model inference architecture.

Action server program file, used to receive action lists requested by the model server, control robot movements, and play sounds.

2.1.3 launch

Startup file folder, used to store ROS2 node startup files.

The startup file for the AI large model embodying intelligent gameplay: starts multiple nodes, with two startup modes: voice interaction mode and text interaction mode.

2.1.4 resources_file

Stores audio files for system sounds

2.1.5 utils

Component folder, stores program files for non-core functions

The large model interface program file contains the underlying interface for calling the large model on various platforms. The calling procedures vary between different platforms and models, so the interface file manages the calling methods uniformly.

The voice module driver file is used to drive the voice module's wake-up word function.

The large model prompt word file: used to generate prompt words for the execution layer large model.

The dify API function is used by the international version of dify to request the local dify application.

2.2 Inter-program call diagram

Source code analysis.drawio

3. Speech Recognition Function

The speech recognition function includes two parts: VAD voice activity detection and speech-to-text conversion. Source code path:

Jetson Orin Nano, Jetson Orin NX host:

Jetson Nano, Raspberry Pi host:

You need to enter Docker first.

3.1 Program Flowchart

asr program flowchart

否 / No
是 / Yes
Ctrl+C
是 / Yes
否 / No
开始
Start
初始化ROS2节点
Initialize ROS2 Node
初始化参数配置
Initialize Parameters
初始化大模型接口
Initialize LLM Interface
初始化语音唤醒
Initialize Wake Word Detection
初始化ASR模型
Initialize ASR Model
初始化语言设置
Initialize Language Settings
初始化WebRTC VAD
Initialize WebRTC VAD
创建服务客户端和发布者
Create Service Clients and Publishers
打印初始化完成信息
Print Initialization Info
音频请求队列是否为空
Is Audio Request Queue Empty?
获取音频请求
Get Audio Request
发送初次响应
Send Initial Response
开始VAD检测录音
Start VAD & Recording
进行ASR转换
Perform ASR
ASR结果是否有效
Is ASR Result Valid?
ROS2自旋一次
ROS2 Spin Once
休眠0.1秒
Sleep 0.1s
销毁节点
Destroy Node
关闭ROS2
Shutdown ROS2
结束
End
发布ASR结果
Publish ASR Result
发布错误响应
Publish Error Response
等待用户说话
Wait for User Speech

 

3.2 VAD Voice Activity Detection

Implementation: The listen_for_speech method in the ASRNode class

Program Explanation: Records real-time audio from a specified microphone and uses VAD (Voice Activity Detection) to determine whether speech is currently occurring. When a segment of speech is detected (continuous silence exceeds a set number of frames), recording stops and the valid speech content is saved to a file.

Detailed logic is as follows:

  1. Initialize the audio stream and configure parameters (such as sampling rate and number of channels).
  2. Continuously read audio frames and perform voice activity detection.
  3. If speech starts, add the audio frame to the buffer. If continuous silence exceeds a threshold (90 frames, approximately 1 second), recording ends.
  4. After recording is complete, remove the trailing silence and save the valid speech content as a WAV file.
  5. If no valid speech is detected, the file is not saved.

3.3 ASR Speech Recognition

Implementation: The ASR_conversion method in the ASRNode class converts the recording file into text.

Call the speech recognition model interface function in the large_model_interface.py large model interface file.

Note: Text recognized by speech recognition with less than 4 characters is considered invalid. This is to prevent the content after false activation from being mistaken for valid content.

Program Explanation:

  1. If using oline_asr, call the corresponding method and check whether the result is valid text (length greater than 4). If successful, return the recognized content; otherwise, log the error and return 'error'.
  2. Otherwise, use SenseVoiceSmall_ASR for recognition and perform the same result judgment and processing.

SenseVoiceSmall_ASR is a local model speech recognition method and is only available on Jetson Orin Nano and Jetson Orin NX hosts.

 

4. Model Server Functionality

The implementation program is model_service.py, which receives voice recognition results in voice interaction mode or terminal input in text interaction mode, and implements the large model inference logic. Source code path:

Jetson Orin Nano, Jetson Orin NX host:

Jetson Nano, Raspberry Pi host:

You need to enter Docker first.

4.1 Program Flowchart

model_service program flowchart

否 / No
是 / Yes
音频请求
Audio Request
否 / No
是 / Yes
actionstatus消息
Action Status
否 / No
是 / Yes
tts消息
TTS Message
否 / No
是 / Yes
asr消息
ASR Message
seehat消息
Seehat Message
否 / No
是 / Yes
否 / No
是 / Yes
开始
Start
初始化参数配置
Initialize Parameters
初始化大模型
Initialize LLM
初始化系统声音
Initialize System Audio
初始化ROS通信
Initialize ROS
创建订阅者和服务
Create Subscribers & Services
收到消息或请求?
Received Message or Request?
消息类型
Message Type
播放音频回调
Play Audio Callback
音频文件是否存在
Does Audio File Exist?
合成语音并播放
Synthesize & Play Audio
播放音频文件
Play Audio File
消息内容是finish?
Is Message 'finish'?
反馈动作结果到执行层
Send Feedback to Executor
开启新指令周期
Start New Instruction Cycle
语音合成回调
TTS Callback
合成成功?
TTS Successful?
记录错误日志
Log Error
播放合成语音
Play TTS Audio
双模型模式?
Dual-Model Mode?
双模型模式?
Dual-Model Mode?
单模型推理模式
Single-Model Inference
提取动作和回复
Extract Action & Reply
发送动作到服务
Send Action to Service
双模型推理模式
Dual-Model Inference

 

4.2 Dual-Model Inference (Domestic Version)

Dual-model inference mode is used by default. The implementation program is the dual_large_model_mode and instruction_process methods in the LargeModelService class.

Program Explanation:

  1. If this is a new instruction cycle (self.new_order_cycle is True):
  1. Otherwise (not a new cycle):

 

4.3 Dual-Model Inference (International Version)

  1. The program implementation logic is the same as the domestic version. The difference is that the international version requests the local Dify application, which then requests the large cloud model.

 

5. Action Server Functionality

The implementation program is action_service.py, which receives the action list requested by the model server, parses the action list, and executes it. Source code path:

Jetson Orin Nano, Jetson Orin NX host:

Jetson Nano, Raspberry Pi host:

You need to enter Docker first.

5.1 Program Flowchart

action_service program flowchart

否 / No
是 / Yes
是 / Yes
否 / No
是 / Yes
否 / No
开始
Start
初始化节点
Initialize Node
等待动作列表请求
Wait for Action List Request
动作列表是否为空?
Is the Action List Empty?
动作列表长度是否为1?
Is Action List Length 1?
发送语音请求
Send Voice Request
返回成功信息
Return Success Info
结束
End
发送语音请求
Send Voice Request
解析并执行单个动作
Parse and Execute Single Action
动作执行是否成功?
Was Action Successful?
发送语音请求
Send Voice Request
使能组合模式
Enable Combo Mode
依次解析并执行每个动作
Parse and Execute Each Action in Order
发布错误信息,中止任务
Publish Error Info and Abort Task
发布反馈信息
Publish Feedback Info
停止机器人
Stop Robot
重置组合模式标志位
Reset Combo Mode Flag
发布执行完成信息
Publish Completion Info
返回成功信息
Return Success Info

 

5.2 Action Function Library

The robot's action functions are derived from methods in the CustomActionServer class, which contains functions for various sub-actions. Here, we'll use the action function that controls the robot's chassis movement as an example. Other action functions will be explained in subsequent chapters when they first appear.

Program Explanation:

  1. Convert the input string parameter to a floating-point number.
  2. Create and set a Twist-type motion command.
  3. Publish the corresponding speed topic and stop when finished.
  4. Determine whether to publish a completion message based on the mode.

5.3 Parsing the Action List to Control Robot Functions

Parse the action list generated by the large model into functions that control the robot entity and execute them. This is implemented in the execute_callback method of the CustomActionServer class.

Program Explanation:

  1. Receive the action list (string format) sent by the client;
  2. If the action list is empty, return a success result;
  3. If there is only one action, parse and execute the corresponding method; abort if failure occurs;
  4. If there are multiple actions, execute them sequentially in combined mode, logging and providing feedback as they occur;
  5. After all actions are executed, call stop() to stop the robot;
  6. Finally, return a successful result to the client.

6. Interruption Functionality

The robot supports interruptions at any stage, specifically during recording, conversation, and action. The principles for interruption in each stage are explained here.

6.1 Interruption During Recording

If you notice a mistake during recording, or are dissatisfied with the content and need to re-record, you can simply wake up the recording process, interrupt the previous recording, and restart the recording.

 

6.2 Interrupting the Conversation Phase

If you're dissatisfied with the robot's response or don't want it to continue, you can interrupt it using the wakeup word and start recording. At this point, you can give the robot new commands (while still in the current task cycle), or say "End current task" to end the current task and start a new one.

When play_audio plays audio, it will detect whether self.stop_event is set. Once it detects that it is set, it will stop the currently playing audio immediately.

 

6.3 Action Phase Interruption

If the robot is awakened during an action, it will stop the current action and resume its initial posture. This can be categorized as either a standard action interruption or an action interruption with a child process.

6.3.1 Standard Action Interruption

The implementation is in the _execute_action and pubSix_Arm methods in the CustomActionServer class in action_service.py.

6.3.2 Interrupting Actions with Subprocesses

For example, actions like robotic gripping and sorting machine code require launching an external program within a subprocess. Here, we use the robotic gripping action function grasp_obj as an example:

When the robotic gripper is not completed, it waits in a while not self.grasp_obj_future.done(): loop. During this process, if the interrupt flag self.interrupt_flag is set, it first calls the self.kill_process_tree function to recursively terminate the subprocess tree and then terminate the action.