̂
𝐵
to balance exploration and exploitation. For example, when obstacles
are detected,
̂
𝐵
is set to be 1 for pure exploitation. Compared with
standard Q-learning, the probability of collision between AUV and
obstacles is reduced. Based on hierarchical reinforcement learning,
Sun
et al.
(
2020
) designed Hierarchical Deep Q Network for AUV path
planning. The obstacle avoidance and the target approaching are set
as two subtasks to obtain different selection strategies. In addition, the
combination of Hierarchical Deep Q Network and priority experience
replay improves the learning efficiency of AUV. Moreover,
Cao et al.
(
2020
) proposed a potential field hierarchical reinforcement learning
approach to improve the cooperation efficiency of multi-AUV in a target
searching task. In their method, the multi-agent cooperative MAXQ al-
gorithm was used for hierarchical reinforcement learning (HRL) (
Cheng
et al.
,
2007
;
Li et al.
,
2010
;
Shen et al.
,
2006
) and a potential field was
used to automatically adjust parameters of HRL. The proposed method
was shown to be able to enable multi-AUV to successfully bypass the
dynamic and static obstacles and find the nearest target point to each
AUV in simulated experiments.
Compared with other mobile robots, when applying reinforcement
learning to AUV path planning, it usually uses the output of the learning
algorithm to select actions from the action space and directly control
the rudder, elevator and propeller of AUV. When the current exists in
the environment, the reinforcement learning algorithm will take the
current as one of the state inputs, and use continuous iterative learning
to better deal with the current. Reinforcement learning has proven to
be able to facilitate AUV to complete target search and navigation in
environments with unknown obstacles though mostly in simulations.
However, how to learn to find an optimal policy in tasks with large
state space and apply to a physical AUV platform still remain to be a
huge challenge.
4.6. Deep reinforcement learning
Deep reinforcement learning (DRL) combines the perception of
deep learning with the decision making of reinforcement learning. The
advantage of DRL is that it can use deep learning (e.g., a deep neural
network) to automatically learn low-dimensional state characteristics of
high-dimensional states by reducing dimensions through iterative inter-
action with the environment. It solves the limitations of reinforcement
learning caused by large state and action space (
Arulkumaran et al.
,
2017
). Deep reinforcement learning has opened up a new way to solve
the problem of learning from the complex nonlinear, high-dimensional
sensory input in unknown environments. DRL has been widely used in
obstacle avoidance of surface unmanned aerial (
Singla et al.
,
2019
;
Yan
et al.
,
2019
), and more and more researchers have started trying to
apply it to AUV path planning.
For example, in a target search task,
Cao et al.
(
2019
) applied
the asynchronous advantages actor–critic (A3C) method (
Mnih et al.
,
2016
) to obstacle avoidance of AUV. In their method, an A3C net-
work structure was used in which each thread independently interacts
with the environment. The learning results of all threads were col-
lected into a global actor–critic pair and combined with a dual stream
Q-network (
Simonyan and Zisserman
,
2014
). The network structure
is composed of multiple convolution layers and long-term and long
short-term memory (
Hochreiter and Schmidhuber
,
1997
). The input
information goes through Otsu method (
Gupta et al.
,
2018
), disk-
shaped structural, closing operation, disk-shaped structural, coordinate
system transformation and rasterization to remove the noise points
of the original sonar image. Their simulated experiments show that
AUV with A3C can effectively avoid obstacles in various environments
and complete the target search task efficiently. In addition,
Wu et al.
(
2019
) proposed an end-to-end AUV motion control framework based
on the Proximal Policy Optimization algorithm (
Schulman et al.
,
2017
),
which takes the original sonar sensory information as the input directly
and does not need to consider the dynamic characteristics of AUV.
In their experiments, the reward function takes multiple objectives
such as waypoint tracking, obstacle avoidance, collision penalty and
speed as constraints, making the algorithm more suitable for AUV
path planning in underwater dangerous environments full of obstacles.
In addition, in order to avoid the difficulty and noise of underwater
positioning, they proposed a new state encoder and reward shaping
strategy, which can realize learning without knowing the position of
AUV. Their results in detailed comparative experiments show that
AUV can complete the obstacle avoidance task in a 2D environment.
Do'stlaringiz bilan baham: |