Embodied intelligent functions core source code

1. Course Content

  1. Embodied AI gameplay using large language models is a complex function involving the coordinated operation of multiple program nodes. This section of the course will explain the core programs involved.

2. Source Code Package Structure

2.1 Function Package File Structure

The ROS function package for the AI ​​large model embodied intelligence is largemodel, and the package path is:

Jetson Orin Nano, Jetson Orin NX host:

Jetson Nano host:

Requires entering the Docker container first:

The function package file structure is as follows:

image-20250709183354363

Description of folders and files:

2.1.1 config

Configuration folder, stores configuration files

Large model interface configuration file, used to configure API keys and large model parameters for various platforms.

Node configuration file: used to configure parameters for core nodes.

2.1.2 largemodel

Source code folder, core program for AI large model embodied intelligence gameplay

Speech recognition program file, used to run the speech recognition model and interact with the user.

Model server program file, used to call various model interfaces to implement the model inference architecture.

Action server program file, used to receive the action list requested by the model server, control the robot's movement and play sounds.

2.1.3 launch

Launch file folder, used to store ROS2 node launch files

Launch file for AI large model embodied intelligence gameplay: starts multiple nodes, with two startup methods: voice interaction mode and text interaction mode.

2.1.4 resources_file

Stores audio files for system sounds.

2.1.5 utils

Component folder, stores program files for non-core functions

Large model interface program file, contains the underlying interfaces for calling large models from various platforms. The calling programs for different platforms and models are different, and the calling methods are uniformly managed in the interface file.

Voice module driver program file, used to drive the wake-up word function of the voice module.

Dify API functions, used for requesting the local Dify application from the international version of Dify.

2.2 Inter-program Call Relationship Diagram

源码解析

 

3. Speech Recognition Function

The speech recognition function includes two parts: VAD (Voice Activity Detection) and speech-to-text conversion. The source code path is:

Jetson Orin Nano, Jetson Orin NX host:

Jetson Nano host:

Requires entering the Docker container first:

3.1 Program Flowchart

asr program flowchart.

否 / No
是 / Yes
Ctrl+C
是 / Yes
否 / No
开始
Start
初始化ROS2节点
Initialize ROS2 Node
初始化参数配置
Initialize Parameters
初始化大模型接口
Initialize LLM Interface
初始化语音唤醒
Initialize Wake Word Detection
初始化ASR模型
Initialize ASR Model
初始化语言设置
Initialize Language Settings
初始化WebRTC VAD
Initialize WebRTC VAD
创建服务客户端和发布者
Create Service Clients and Publishers
打印初始化完成信息
Print Initialization Info
音频请求队列是否为空
Is Audio Request Queue Empty?
获取音频请求
Get Audio Request
发送初次响应
Send Initial Response
开始VAD检测录音
Start VAD & Recording
进行ASR转换
Perform ASR
ASR结果是否有效
Is ASR Result Valid?
ROS2自旋一次
ROS2 Spin Once
休眠0.1秒
Sleep 0.1s
销毁节点
Destroy Node
关闭ROS2
Shutdown ROS2
结束
End
发布ASR结果
Publish ASR Result
发布错误响应
Publish Error Response
等待用户说话
Wait for User Speech

3.2 VAD Voice Activity Detection

Implementation Program: The listen_for_speech method in the ASRNode class.

Program Description: The program uses a specified microphone to record audio in real time and uses VAD (Voice Activity Detection) to determine if someone is speaking. When a segment of speech is detected to have ended (continuous silence exceeding a set number of frames), recording stops and the valid speech content is saved to a file.

Detailed logic is as follows:

  1. Initialize the audio stream and configure parameters (such as sampling rate, number of channels, etc.).
  2. Continuously read audio frames and perform voice activity detection.
  3. If speech is detected, add the audio frames to the buffer; if continuous silence exceeding the threshold (90 frames, approximately 1 second) is detected, end the recording.
  4. After recording ends, remove the trailing silence and save the valid speech as a WAV file.
  5. If no valid speech is detected, no file is saved.

3.3 ASR Speech Recognition

Implementation: The ASR_conversion method in the ASRNode class converts the audio file into text.

It calls the speech recognition model interface function in the large_model_interface.py file.

Note: Text recognized as having fewer than 4 characters will be considered invalid. This is to prevent misrecognized content from being treated as valid.

Program Explanation:

  1. If using online_asr, the corresponding method is called, and the result is checked to see if it is valid text (length greater than 4). If successful, the recognized content is returned; otherwise, an error is logged, and 'error' is returned.
  2. Otherwise, SenseVoiceSmall_ASR is used for recognition, and the same result judgment and processing are performed.

Note that SenseVoiceSmall_ASR is a local model speech recognition method and can only be used on Jetson Orin Nano and Jetson Orin NX hosts.

 

4. Model Server Functionality

The program is implemented in model_service.py, which receives speech recognition results in voice interaction mode or terminal input in text interaction mode, and implements the large language model inference logic. The source code path is:

Jetson Orin Nano, Jetson Orin NX host:

Jetson Nano host:

Requires entering the Docker container first:

4.1 Program Flowchart

model_service program flowchart

否 / No
是 / Yes
音频请求
Audio Request
否 / No
是 / Yes
actionstatus消息
Action Status
否 / No
是 / Yes
tts消息
TTS Message
否 / No
是 / Yes
asr消息
ASR Message
seehat消息
Seehat Message
否 / No
是 / Yes
否 / No
是 / Yes
开始
Start
初始化参数配置
Initialize Parameters
初始化大模型
Initialize LLM
初始化系统声音
Initialize System Audio
初始化ROS通信
Initialize ROS
创建订阅者和服务
Create Subscribers & Services
收到消息或请求?
Received Message or Request?
消息类型
Message Type
播放音频回调
Play Audio Callback
音频文件是否存在
Does Audio File Exist?
合成语音并播放
Synthesize & Play Audio
播放音频文件
Play Audio File
消息内容是finish?
Is Message 'finish'?
反馈动作结果到执行层
Send Feedback to Executor
开启新指令周期
Start New Instruction Cycle
语音合成回调
TTS Callback
合成成功?
TTS Successful?
记录错误日志
Log Error
播放合成语音
Play TTS Audio
双模型模式?
Dual-Model Mode?
双模型模式?
Dual-Model Mode?
单模型推理模式
Single-Model Inference
提取动作和回复
Extract Action & Reply
发送动作到服务
Send Action to Service
双模型推理模式
Dual-Model Inference

 

4.2 Dual Model Inference (Domestic Version)

The dual model inference mode is used by default. The implementation is in the dual_large_model_mode and instruction_process methods of the LargeModelService class.

Program interpretation:

  1. If it's a new instruction cycle (self.new_order_cycle is True):
  1. Otherwise (not a new cycle):

 

4.3 Dual-Model Inference (International Version)

  1. The program implementation logic is the same as the domestic version, the difference being that the international version requests the local Dify application, which then requests the cloud-based large language model.

 

5. Action Server Functionality

The program implementing this functionality is action_service.py, which receives the list of actions requested by the model server, parses the action list, and executes them. Source code path:

Jetson Orin Nano, Jetson Orin NX host:

Jetson Nano host:

Requires entering the Docker container first:

5.1 Program Flowchart

Saction_service program flowchart

否 / No
是 / Yes
是 / Yes
否 / No
是 / Yes
否 / No
开始
Start
初始化节点
Initialize Node
等待动作列表请求
Wait for Action List Request
动作列表是否为空?
Is the Action List Empty?
动作列表长度是否为1?
Is Action List Length 1?
发送语音请求
Send Voice Request
返回成功信息
Return Success Info
结束
End
发送语音请求
Send Voice Request
解析并执行单个动作
Parse and Execute Single Action
动作执行是否成功?
Was Action Successful?
发送语音请求
Send Voice Request
使能组合模式
Enable Combo Mode
依次解析并执行每个动作
Parse and Execute Each Action in Order
发布错误信息,中止任务
Publish Error Info and Abort Task
发布反馈信息
Publish Feedback Info
停止机器人
Stop Robot
重置组合模式标志位
Reset Combo Mode Flag
发布执行完成信息
Publish Completion Info
返回成功信息
Return Success Info

5.2 Action Function Library

The robot's action functions come from the methods in the CustomActionServer class, which includes functions for various sub-actions. This section uses the function for controlling the robotic arm to move upwards as an example; other action functions will be explained when they first appear in subsequent chapters.

Program explanation:

  1. Convert the input string parameter to a floating-point number.
  2. Directly call the function, passing the parameter, and communicate with the underlying driver board to control the six servos.
  3. Determine whether to publish the completion message based on the mode.

5.3 Parsing the Action List to Control Robot Functionality

This section describes parsing the action list generated by the large language model into functions that control the robot entity, and then executing them. The implementation is found in the execute_callback method of the CustomActionServer class.

Program Description:

  1. Receives the action list (in string format) sent by the client;
  2. If the action list is empty, return a success result directly;
  3. If there is only one action, parse and execute the corresponding method; if it fails, terminate the process;
  4. If there are multiple actions, enter a combined mode to execute them sequentially, recording logs and providing feedback during the process;
  5. After all actions are completed, call stop() to stop the robot;
  6. Finally, return the successful execution result to the client.

6. Interruption Function

The robot supports interruptions at any stage, which can be divided into recording stage interruptions, dialogue stage interruptions, and action stage interruptions. This section introduces the principles of interruption at each stage.

6.1 Recording Stage Interruption

If you make a mistake during recording, or are dissatisfied with the recorded content and need to re-record, you can interrupt the previous recording and start speaking and recording again by re-activating the robot during the recording process.

 

6.2 Interrupting the Dialogue Phase

If you are dissatisfied with the robot's response during its speech or don't want the robot to continue speaking, you can use the wake word to interrupt the robot's speech and start recording your voice. At this point, you can give the robot a new command (still within the current task cycle), or you can say "End current task" to directly end the current task and start a new task cycle.

When playing audio, the system checks if self.stop_event has been set. If it detects that it has been set, it immediately stops the currently playing audio.

 

6.3 Action Phase Interruption

If the robot is interrupted during the execution of an action, it will stop the current action and return to its initial posture.

For example, actions such as robotic arm grasping and sorting require launching external programs in a subprocess. Here, we use the color block handling function change_pose as an example:

While the color block handling is not complete, the program will continuously wait in the while not self.change_pose_future.done(): loop. During this process, if the self.interrupt_flag interrupt flag is detected as set, the check_close_change_pose function will be called to recursively terminate the subprocess tree, then the robotic arm will be controlled to return to the grasping posture, and the task will be terminated.