Voronoi-Based Multi-Robot Autonomous Exploration in Unknown Environments via Deep Reinforcement Learning

Autonomous exploration is an important application of multi-vehicle systems, where a team of networked robots are coordinated to explore an unknown environment collaboratively. This technique has earned significant research interest due to its usefulness in search and rescue, fault detection and monitoring, localization and mapping, etc. In this paper, a novel cooperative exploration strategy is proposed for multiple mobile robots, which reduces the overall task completion time and energy costs compared to conventional methods. To efficiently navigate the networked robots during the collaborative tasks, a hierarchical control architecture is designed which contains a high-level decision making layer and a low-level target tracking layer. The proposed cooperative exploration approach is developed using dynamic Voronoi partitions, which minimizes duplicated exploration areas by assigning different target locations to individual robots. To deal with sudden obstacles in the unknown environment, an integrated deep reinforcement learning based collision avoidance algorithm is then proposed, which enables the control policy to learn from human demonstration data and thus improve the learning speed and performance. Finally, simulation and experimental results are provided to demonstrate the effectiveness of the proposed scheme.

to hardware faults, adaptability to environmental changes and cost-effectiveness. Therefore, multi-robot systems have many potential applications, such as object transportation [1], security surveillance [2], disaster detection [3], formation coordination [4], [5], vehicle-assisted wireless communication networks [6], autonomous shepherding [7], connected vehicle platoons [8], [9], etc. As an important component in multi-robot coordination, cooperative exploration techniques have been widely used in search and rescue missions and mapping of unknown environments, which motivates the need to develop more efficient and feasible solutions.
In recent years, many advanced control methods and path planning techniques have been investigated for use in single robots to perform exploration tasks. In [10], Frontier Exploration planning (FEP) and receding horizon Next-Best-View Planning (NBVP) were combined, using FEP as a global exploration planner and receding horizon NBVP for local exploration. A two-stage heuristic information gain-based NBVP algorithm was proposed in [11] for autonomous exploration and reconstruction in 3-D unknown environments. The two-stage planner included a frontier-based boundary coverage planner and a fixed start open travelling salesman problem solver, such that it returns an optimal path. In [12], a priori knowledge-based dynamic object search strategy was proposed to improve the search efficiency of mobile robot in home environments. An incremental road-map based path planning strategy was developed in [13], where the feasible global path was further optimized with the proposed trajectory optimization method that considered the motion constraints of the robot. In [14], the traditional approach of frontier-based exploration and deep reinforcement learning were combined to allow a robot to autonomously explore unknown cluttered environments. However, in the aforementioned works, autonomous exploration methods were only designed for single-robot cases, which may be viewed as a limitation in exploring large unknown areas. The use of multiple cooperating robots offers several advantages, e.g., reduced mission completion time, increased fault-tolerance of the whole system, and improved localization efficiency [15], which motivates the need for further research that considers multi-robot cases.
When multiple robots are deployed in an exploration task, the autonomous exploration problem becomes more challenging as the cooperation between the robots must be considered thoroughly to ensure high levels of efficiency and to avoid conflicts This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ such as repeated trajectories. Some pioneering research works have been conducted in recent years. As an example, in [16], a significant contribution was made in developing a collaborative multi-robot system that built an information network in an unknown environment by deploying information nodes. By using the proposed networking strategy, the power consumption was reduced while preserving the global coverage. However, this article did not include hardware experiments to validate the theoretical results. In [17], pioneering research work was completed on a decentralized cooperative exploration strategy for a mobile robotic team equipped with range finders using a sensor-based random graph method. A trade-off between information gain and navigation cost was considered in the local planner. In another study [18], a Gaussian mixture model for global mapping to model complex environment geometries was proposed. A small memory footprint was maintained which enables distributed operation with a low volume of communication. A local occupancy grid for use in planning from the Gaussian mixture model was then generated using Monte Carlo ray tracing. In [19], autonomous exploration and mapping strategy was developed for a heterogeneous multi-robot systems including both quadrotors and wheeled mobile robots. Even though exploring a static environment using a team of robots has been investigated extensively [20], [21], dealing with unexpected obstacles, such as human movements and dynamic environments, remains an open question.
With improvements to computational capability and access to large training databases, deep learning has shown great potential for solving autonomous vehicle navigation problems. In contrast to artificial potential field based collision-free path planning methods [22]- [24], which require comprehensive knowledge about the environment (e.g., the sizes and positions of the obstacles) and the robotic platforms (e.g., the accurate mathematical model of the robots), deep reinforcement learning techniques have the potential to achieve safe navigation with less work area information in advance. Besides, since the neural network is trained end-to-end and then map the vehicle sensor data into action command directly, the analysis complexity to obtain the desired trajectory is hence largely simplified for different scenarios.
Some recent progresses in using deep learning techniques to solve autonomous navigation problems are introduced as follows. In [25], an efficient method to solve the inverse reinforcement learning problem based on the sparse Gaussian process prediction was proposed. The robots were able to mimic the expert behaviour for avoiding collision and the work demonstrated the potential that this could be applied to real-robots. In [26], a robust collision avoidance algorithm was developed for autonomous vehicles, which accelerates the learning of optimal policies and efficiently utilizes the remaining resources. In [27], a multi-stage and multi-scenario training framework for training the collision avoidance policy was proposed. The collision avoidance capability of multiple agents was finally demonstrated in a simulated environment. A reinforcement learning based anti-jamming relay scheme for underwater sensor networks was proposed in [28], which optimized the relay mobility and power allocation without being aware of the underwater channel model and the jamming model. The relay performance of the proposed scheme was also validated by real-world experiments in a water tank. In [29], a successor-feature-based deep reinforcement learning algorithm was developed that improved the training speed and transferred learned experience to the new scenarios. A novel deep reinforcement learning algorithm was designed in [30] to integrate long short-term memory and mocalmap critic, enabling the robot to navigate with a limited field of view, whilst outperforming methods that used a wider field of view. In [31], a long-term path planning algorithm using deep reinforcement learning instead of a near field collision avoidance algorithm [32] was proposed to navigate the robot to visit unexplored environments with a greedy strategy. An end-to-end deep reinforcement learning based collision avoidance algorithm was proposed in [33]. The laser data and the relative target position were used as inputs to a neural network and the policy generated heading and speed commands. An asynchronous version of Deep Deterministic Policy Gradient (DDPG) was proposed to improve the Q-value. The efficiency was validated in Gazebo simulation and the real-world environment. However, the performance of conventional deep learning methods relies heavily on the training data and hence a human's experience is not well utilized during this process. How to improve the output performance and accelerate the training process by exploiting human experience and guidance is still an interesting area which needs further efforts.
Motivated by the consistent progress and technological advancements in applications of cooperative exploration, especially by the increasing need to develop novel coordination techniques for handling multiple robots, in this paper, we aim to design an efficient autonomous exploration strategy for a decentralized collaborative multi-robot team while avoiding suddenly observed obstacles. The contributions of the paper can be summarized as follows: r A Voronoi-based exploration strategy is developed to coordinate a multi-robot team effectively in exploring an unknown area. Each robot is assigned a different target location based on dynamic Voronoi partitions to avoid duplicated exploration areas. r A collision avoidance algorithm with deep reinforcement learning is proposed to navigate the robot to the target. The proposed method enables the control policy to learn from human demonstration data and outperforms conventional methods in terms of training speed and final performance.
r The feasibility of the proposed cooperative exploration strategy is validated by real-world experiments using wheeled mobile robots. The rest of the paper proceeds as follow. Section II covers the technical background. In Section III, a hierarchical control architecture for networked explorers is proposed. A Voronoibased exploration algorithm and deep reinforcement learning based collision avoidance approach are then provided to coordinate the robots efficiently while avoiding sudden obstacles. Simulation case studies and experimental validation using real mobile robots are given in Section IV to highlight the feasibility and efficiency of the proposed strategy. Section V concludes the paper.

A. Communication Networks
Consider a weighted and undirected information graph I = (V, E, A) with a non-empty set of nodes V = {1, 2, . . . , N}, a set of edges E ⊂ V × V, and the associated adjacency matrix A = [a ij ] ∈ R N ×N . An edge rooted at the i th node and ended at the j th node is denoted by (i, j), which means information can flow from node i to node j. a ij is the weight of edge (j, i) representing the distance between these two nodes and a ij = 0 if (j, i) ∈ E. Node j is called a neighbor of node i if (j, i) ∈ E. An undirected graph I is connected if there is a path between every pair of distinct nodes. The shortest-path distance between two nodes in a graph is the sum of weights for all edges in a shortest path connecting them.

B. Voronoi Partitions
We partition the area between the available mobile robotic platforms using Voronoi partitions in which the centroid of each Voronoi cell is taken to be the position of a single mobile robot. Thus, a certain region within this area (namely the corresponding Voronoi cell) is allocated to each robot for exploring. This is performed on an iterative basis, so the Voronoi partitions are dynamic in nature.
Using Voronoi partitions, the area to be mapped can be broken up dynamically among the robot team members based on their current locations. Also by construction, Voronoi partitioning can be implemented in a decentralized fashion via inter-robot communications.
Let the convex polygon Q ⊂ R 2 be an open space without obstacles to be explored. Suppose R i ∀i ∈ {1, 2, . . . , N} are networked robots and N R i is a robot set satisfying that each element in this set is a neighbor robot of R i , and p i represents the position of robot R i . The Voronoi cell Vor(R i ) is defined by: (1) which means the Voronoi partition of Q, generated by R i , is the set of all points in Q such that all points in the region Vor(R i ) are closer to R i than any other point in Q.
Notice that to define the boundaries of each Voronoi cell, robot i at position p i only needs to know the boundary of Q, and the position of its neighbor p j , such that the Voronoi cells of p i and p j share a common edge. Using dynamic Voronoi partitions, each robot can compute its partition with only knowledge of its neighbors locations. Thus, using Voronoi partitioning facilitates decentralized control designs. Another advantage of using Voronoi partitions is in the case where there is a robot or sensor failure. Because the Voronoi partitions are made dynamically, the team can adjust their Voronoi partition configuration taking into account their new neighbors excluding the failed robot. This procedure ensures that all regions in the area will be covered by an operational robot.

C. Collision Avoidance Problem Formulation
The collision avoidance problem for autonomous agents is defined in the context of each autonomous agent moving on the Euclidean plane in the assumption that obstacles might appear. The problem can be formulated as to find the translation function: where l t is the laser range data at time step t, p t stands for the relative position of the target, v t−1 denotes the velocity command at the last time step. The model calculates the next time velocity command v t . It is assumed that each agent only has partial observations in the sensing range of the rangefinder r s . This assumption makes our method more practical and robust in real world environments.

A. Hierarchical Control Architecture
In this subsection, a two-layer control architecture for networked robots is presented, which includes a high-level decision making layer and a low-level target tracking layer as shown in Fig. 1. In the first layer, a desired next frontier point is selected based on the Voronoi partition and the synchronized map. This information is sent to the second layer for tracking and a deep reinforcement learning neural network is used to train the robot to reach the desired position while avoiding potential obstacles.
The cooperative exploration method is proposed under the following assumptions: 1) The robots are equipped with an omnidirectional sensory system that provides a description of the free space surrounding the robot and detects the boundary of an obstacle within the maximum sensing range r s .

B. Coordination Algorithm for Explorer Robots
Frontier-based techniques have been widely used in the multirobot exploration missions, motivated by [16], [17], [34], the proposed cooperative exploration strategy in the first layer is introduced in this section. In order to mark the explored area, every robot in the multi-vehicle system deploys information nodes while exploring the environment. Those deployed information nodes form an information network, which allows the robots to share information in a decentralized manner. Note that the information nodes can be real sensor devices or virtual targets in the synchronized map of each robot. Once an information node is deployed in a certain position, it will generate some new frontiers points on the border between a sensed area (e.g., blue area in Fig. 2) and an open area covered by no sensor (e.g., white area in Fig. 2). In the case where no frontier point can be found, it can be concluded that the whole environment is fully explored.
At the beginning, the robot R i deploys a new information node and generates some new frontier points. Now, we will define the utility function for frontier points k assigned to the robot R i as where λ is a scalar that satisfies 0 ≤ λ ≤ 1, where d ik represents the distance between R i and frontier point k and φ ik denotes the distance between the frontier point k and the initial position of R i (i.e., the position of the first information node). The robot R i iteratively searches for the frontier node with the minimum Ω ik . Note that when λ = 0, the robots perform a breadth-first exploration, that is, they explore all of the neighbor frontier points at the present depth, prior to moving on to the frontier points at the next depth level. On the contrary, if λ = 1, the robots will adopt depth-first exploration, where the robots explore the frontier points as far as possible before being forced to backtrack and expand other frontier points. Based on different environments and tasks, the value of λ should be fixed by the engineers to achieve the best exploration and mapping performance. Whenever robot R i deploys a new information node, the information graph I is updated and each robot broadcasts the updated I so that every robot sharing the network can update I. In order to reach the selected frontier point f R i , a graph search is performed. R i locates the information node that generates the selected frontier point first, then R i selects the short-path distance in the graph to reach the desired information node. We define I t i as the subgraph of I at time t generated by the set of nodes which can be observed by R i . According to the definition of I t i , this path is a collision-free path for R i . Besides, the length of each line segment along this path is bounded by r s . Once R i reaches f R i , it deploys another new information node and iterates this procedure until there is no frontier point in the map. An example is shown in Fig. 2 for a single-robot exploration task using the proposed strategy.
When there exist multiple explorers in the task, it can be observed that, by using the aforementioned strategy, the robots may move to the same direction and explore duplicated areas, thus causing low efficiency and longer mission accomplishment time. In order to avoid such scenarios, a Voronoi-based method is introduced. For any robot i, based on the position of the neighbor robot, R j , a Voronoi partition is generated and only those frontier points in its own Voronoi partition will be considered for the next movement. This will effectively remove undesirable frontier points to be selected by the explorers. An example of the dynamic Voronoi partition of three robots in the autonomous exploration task is illustrated in Fig. 3.
Towards this end, we consolidate the aforementioned multirobot cooperative exploration techniques into an algorithm (Algorithm 1) which provides a systematic set of guidelines for the robotic practitioners to implement.

C. Deep Reinforcement Learning Setup
A mapless collision avoidance algorithm is proposed for navigating the mobile robots to go through the waypoints generated if robot R i detects no information node inside the circle with radius r s then 4: R i drops an information node at the current position; 5: end if 6: if f R i is empty then 7: R i searches for the next frontier point which is inside its own Voronoi partition with minimum Ω ik ; 8: The selected frontier point is set as f R i ; 9: R i starts moving along the shortest path in I t i to reach f R i ; 10: end if 11: if R i meets f R i then 12: R i drops an information node at the position; 13: else 14: R i keeps moving to reach f R i ; 15: end if 16: end for 17: until there is no frontier point for every information node; by the cooperative exploration algorithm. The mapless navigation algorithm allows the mobile robots to navigate through the waypoints without having accurate information of static obstacles in advance, which improves the safety of the mobile robots significantly. As it is expected that some obstacles will appear while the mobile robot is navigating through the waypoints, a collision avoidance algorithm is required to generate reactive maneuvering.

1) Observation and Action Space:
The laser rangefinder state l t , previous velocity state v t−1 and relative target position state p t at the time t concentrate the observation state vector s t . The relative target state p t is represented in the polar coordinate, i.e., the relative distance and angle between the robot and the target. The continuous action generated by the collision avoidance policy includes linear velocity v l and angular velocity v a .
2) Reward Space: Each robot is commanded to navigate through the waypoints generated by the path planning algorithm, while avoiding the potential collisions. The reward function is defined as where r represents the sum reward, r d represents the distance reward, r cl describes the safety clearance reward, r av denotes the angular velocity reward, and r lv is the linear velocity reward. r d can be calculated by where r a is the arrival reward when the distance between robot and target d p is less than threshold d p min . Otherwise, r d is the distance Δ d p the robot moves towards the target at the last time step. Safety clearance reward r cl can be calculated by where r cp denotes the negative reward/punishment when distance between target and robot d p is between clearance distance threshold d o max and 2d o max . When d p is less than threshold d o max , negative reward r cpo will be applied. Otherwise, the robot will not be punished. The angular velocity reward r av and linear velocity r lv reward are given by where v a max denotes the angular velocity threshold and v l min represents the linear velocity threshold. If angular velocity v a is bigger than v a max or linear velocity v l is smaller than v l min , the robot will receive punishment value r ap or r lp , respectively. Fig. 4, the input to the neural network is the concatenation vector of rangefinder data (24-dimensional vector), relative target position (2-dimensional vector) and velocity data (2-dimensional vector). The input layer is connected with three dense layers with each layer having 512 nodes. The actor network finally generates the linear velocity command through a sigmoid function and produces the angular velocity using a hyperbolic tangent function. These two velocity commands are finally concatenated into the action state, which is used as the input of the critic network shown in Fig. 5. The state data of the robot (28-dimensional vector) is also used as the input of the critic network and is processed by three dense layers and each layer has 512 nodes. The action input is merged with the second layer, and the Q-value is finally generated by a linear activation function.

4) Integration of DDPG and PER: Deep Deterministic
Policy Gradient (DDPG) algorithm is an actor-critic and model-free algorithm that can operate over continuous action spaces. The  performance of DDPG is demonstrated in 20 simulated physics tasks [35]. However, randomly action exploration makes the learning process inefficient in terms of mobile robot collision avoidance application. In order to improve the performance and learning speed of DDPG, we integrate DDPG with Prioritised Experience Replay (PER) algorithm using human demonstration data. In this research, human demonstration data R demo is recorded before training the networks, as shown in Fig. 6. Through observing the robot behaviour and Gazebo environment, we manually control the robot to avoid obstacles and move towards the goal. The reward value r t is calculated by using reward function to evaluate the state data s t and action data a t . Instead of sampling data uniformly, we integrate DDPG and PER algorithm to sample important data more frequently by calculating the priority of each state transition. The integration of DDPG and PEG is shown in Algorithm 2.
In this research, the implementation of PER is given as follows: In Line 6 and 7 (in Algorithm 2), the random process for angular velocity v a and linear velocity v l actions are set as choosing random values from ranges (−ϕ, ϕ) and (0, ψ) respectively, where ϕ and ψ are positive constants depending on the hardware constraints of the robotic platforms. In Line 13, p i describes the priority of a transition, δ means the Temporal Difference (TD) Algorithm 2: Integration of DDPG and PER.

1: Randomly initialize critic neural network Q(s, a|θ Q )
and actor neural network μ(s|θ μ ) with weights θ Q and θ μ . 2: Initialize target network Q and μ with weights θ Q ← θ Q , θ μ ← θ μ 3: Initialize replay buffer R p with priority using demonstration data R demo and default priority value Initialize a random process N for action exploration: Receive initial observation state s 1 9: for t = 1, T do 10: Select action a t = μ(s t ) + N t according to the current policy and exploration noise 11: Execute action a t , calculate reward r t and observe new state s t+1 12: Calculate sampling priority of each transition: 13: Sample a minibatch of N transitions (s i , a i , r i , s i+1 ) from R p according to the sampling priority P (i) 17: Set Set weighted updates to network: Update critic by minimizing the loss: Update the actor policy using the sampled policy gradient: Update the target networks: 22: end for 25: end for error, the second term λ| a Q(s i , a i |θ Q )| 2 denotes the actor loss, τ p represents a small positive sampling probability for each transition, ensuring each transition data can still be sampled and τ D is used for increasing the sampling frequency of demonstration data. Note that τ p and τ D are user-defined constant values based on empirical experience. In Line 14, P (i) defines the sampling probability of the i th transition. In Line 18, the sampling weight ω i of each transition for updating the network is calculated, indicating the importance of each transition data. B stands for the total batch size, and β is a constant value for adjusting the sampling weight. Correspondingly, the critic network and actor network are updated using the equations shown in Line 19 and Line 20.
In terms of the computational complexity of the proposed collision avoidance algorithm, the policy is trained end-to-end and it maps discrete lidar information and relative goal positions into action command directly instead of using the whole environmental map. As shown in Fig. 4, the computational complexity of actor neural networks can be denoted as O(S ain F a1 F a2 F a3 S aout ), where S ain , S aout denote the dimension of the input layer and output layer for actor neural network, respectively. F a1 , F a2 and F a3 represent the dimension of three fully connected layers for actor network, respectively. As shown in Fig. 5, the computational complexity of critic neural networks can be represented by O((S cin F c1 + S aout )F c2 F c3 S cout ), where S cin and S cout represent the dimension of the input layer and output layer for critic network, respectively. F c1 , F c2 and F c3 denote the dimension of three fully connected layers for critic network, respectively. Considering that the collision avoidance algorithm can be deployed on each mobile robot independently, the total computational complexity can be described by O(NM), where N is the number of mobile robots and M represents the function of the dimension of input states, output states and each fully connected layers: Since the dimension of input layer, output layer and each one-dimensional fully connected layer are pre-defined and fixed, M can be considered as a constant. Therefore, for the multi-robot scheme, the proposed collision avoidance algorithm will ensure the computational time increase linearly with the number of mobile robots.

A. Simulation Results
In this subsection, the performance of the proposed Voronoibased cooperative exploration strategy is tested with different numbers of robots. The goal is to use mobile robots to explore and map an unknown environment containing several obstacles. The simulation results demonstrate the effectiveness and feasibility of the proposed scheme and also show its efficiency when coordinating multiple robots. Firstly, we used Gazebo [36] as our simulation software, which is an open source robotic platform, to train the collision avoidance policy. The comparison of the proposed algorithm and DDPG is evaluated in terms of training speed and final performance. Then, the simulation results of three different scenarios (with two, three and four robots, respectively) are provided to verify the feasibility of the proposed Voronoi-based cooperative exploration techniques. Finally, a comparison of the performance with a conventional exploration approach based on multiple trials is provided to show the efficiency of the proposed algorithm.

1) Collision Avoidance Policy Training and Evaluation:
The Turtlebot3 Waffle Pi model is applied in both Gazebo simulation and the real-world experiment. In the Gazebo environment, as shown in Fig. 7(a), cubes, cylinders, spheres and cuboids are used as obstacles and the red sphere represents the target location of the robot. The robot is commanded to go to the target position while avoiding the obstacles. Once the robot arrives at the target or collides with a simulated obstacle, the robot will be reset to be at the origin position and the red sphere is placed randomly   in the outer area of the obstacles. 3000 steps of demonstration data were sampled for training. In terms of the reward function, the threshold values d p min (0.2 m), d o max (0.25 m), v a max (0.8 rad/s), v l min (0.052 m/s) are set according to the geometry and capacity of Turtlebot3 and can also be reconfigured for other robotic platforms. We set τ p = 0.1 and τ D = 0.4 in Algorithm 2 based on empirical experience. The Q-value and reward value comparison between DDPG and the proposed algorithm for the training process are shown in Fig. 8 and Fig. 9 respectively. In Fig. 8, the Q-value of the proposed method arrives at 67 using 8900 steps while the original DDPG needs 63310 steps, showing that the proposed algorithms only needs 14% steps of DDPG. As shown in Fig. 9, the reward value of the proposed method increases faster than DDPG in the beginning and the average reward of the proposed method between 20000 steps and 100000 steps is 6.0782 and that of DDPG is just 2.9883. The fluctuation range of the proposed method is also smaller than DDPG, meaning that the proposed method is more robust.
To evaluate the performance of the trained models in an unknown environment, a new Gazebo environment was applied, as shown in Fig. 7(b). In the new environment, the locations of obstacles are rearranged and a long wall is added, making the environment more challenging than the training environment. We used 20 random missions to evaluate the performance of DDPG and the proposed algorithm. We compared the trained model after only 10000 training steps of the proposed method and the trained model after 50000 training steps of DDPG. It was found that the proposed method enables the robots to complete all 20 missions successfully and even find the targets behind the wall three times. In comparison with the proposed method, the DDPG method only completed 15 of the 20 random missions. In these 15 successful missions, the mobile robot still had unnecessary turns during 3 missions.
In terms of training time, collecting and training 10000 steps data of the proposed algorithm only cost 40 minutes. However, it cost 2 hours and 40 minutes to collect and train 50000 steps data using DDPG. In terms of trajectory quality, the DDPG method makes the robot travel longer distances than the proposed method and also generates unnecessary turns. Therefore, the proposed method enables mobile robot to learn faster, only requiring 14% steps and 25% of the training time required by DDPG, and still achieves better performance than DDPG in terms of safety and trajectory quality.
2) Performance of the Proposed Cooperative Exploration Strategy: As shown in Fig. 10, the size of the arena is set to 20 m × 10 m. There are three obstacles in this environment, which are indicated by the brown, blue, and gray areas. To validate the effectiveness of the proposed strategy, three different scenarios (using 2, 3 and 4 robots) are considered for comparison. The sensing range of each robot is set as r s = 1.3 m and the communication range is set as r c = 5 m. The tuning parameter λ for each robot is set as 0.8 in all the three cases. Note that different selections of λ may only affect total exploration time due to different structures of the environment, but the exploration mission will always be completed within finite time regardless of the value of λ.
In the first case, two explorers are used for this autonomous exploration task. Fig. 10(a) depicts the trajectories of the robots and positions of the obstacles. At the beginning of the task (i.e., t = 0 s), both robots are placed at the top of the arena. Following the proposed algorithm, they start deploying the first information node at their initial positions and thus generate several frontier points at the edge of the sensing area. The dotted line represents the generated Voronoi partition based on their current locations. As it can be seen from the second and third sub-figures in Fig. 10(a), due to the effect of Voronoi partition, one robot explores mainly the left part of the arena while the other moves toward the right side. The information nodes deployed by the two robots are marked by red and blue stars, respectively. Finally, the cooperative exploration task is accomplished within 873 s and no frontier point can be observed in the environment.
In the second case, three robots are implemented to perform the mission. Similar to the first case, they are placed at the top of the arena in the beginning as shown in the first sub-figure in Fig. 10(b). Due to the communication between the robots, the arena is divided into three partitions and each robot will only explore its corresponding area. Since the robots are moving, the Voronoi partition of the whole environment is also time-varying, which guarantees all the collaborative robots are working efficiently. The information nodes deployed by each robot are represented by red, blue, and green stars respectively. From the last sub-figure of Fig. 10(b), the mission accomplishment time is 598 s, which is significantly less than when using two robots and shows the high efficiency in increasing the number of working robots.
In the third case, we increase the number of robots to four for the same autonomous exploration and mapping mission. The simulation results are illustrated in Fig. 10(c) and the information nodes deployed by each robot are represented by red, blue, green, and pink stars, respectively. The mission is completed within 539 s, which is slightly shorter than when using three robots.
3) Comparing the Proposed Cooperative Exploration Method With the Approach in [16]: In this subsection, we compare the performance of the proposed exploration strategy with the conventional approach presented in [16]. Note that the exploration algorithm in [16] can only avoid the same frontier point that is selected by different robots as their next goal, which cannot guarantee the efficiency when coordinating multiple robots in a large unknown environment. Alternatively, the proposed Voronoi-based strategy ensures that each robot explores a certain area without conflicting with others, thus maximally decreasing the task completion time and saves more resources and energy. Besides, the sudden obstacles can also be handled in the proposed scheme via deep reinforcement learning techniques, which provides more robustness to the multi-robot systems when performing the real-world mission.
In order to verify the performance of both strategies, 50 trials were tested for each case shown in the previous subsection corresponding to 50 different sets of initial positions of the explorers. Fig. 11 illustrates the task completion time of the robots subjected to different scenarios. It can be seen that the proposed control scheme leads to a faster exploration speed in all cases, which can be viewed as a significant improvement over conventional methods. Note that due to the limited size of the arena and the density of the robots, the reduction in mission completion time becomes insignificant when the number of robots increases above five. Hence, a trade-off between the exploration time and the available hardware resources should be determined based on the specific environments and requirements of the tasks to achieve better performance.

B. Experimental Validation
In the real-robot experiment, we used three Turtlebot3 Waffle Pi mobile robots (as shown in Fig. 12) to test the feasibility of the proposed autonomous exploration strategy and learning based collision avoidance algorithm in the real world. Each robot was equipped with a 360 • Lidar for detecting the frontier points in Fig. 11. Exploration accomplishment time of the robots using the proposed strategy and conventional strategy in [16]. The results in each case are collected from 50 trials with different initial positions. the unknown environment. The communication between neighboring robots was achieved via Robot Operating System (ROS). The Raspberry Pi 3 board was installed on each Turtlebot3 for sending sensor information to and receiving control command from the host computer with Nvidia GTX 1080 GPU and Intel Core i9 CPU (2.9 GHZ). The lidar sensing range was set as 1 m in the experiment. In the experiments, we used virtual information nodes to validate the proposed exploration strategy. Those virtual information nodes and associated frontier points were generated in the synchronized map on the robots' storage devices. In the future, more advanced robotic platforms can be developed to carry and place real information nodes via manipulator arms.
In this case study, the objective of the mobile robot team was to explore an unknown room. The initial position of each robot is shown in Fig. 13. All the robots were placed at the middle of the room. Each robot deployed an information node at the initial position and then started moving in different directions due to the usage of Voronoi-based algorithms. In order to test the robustness of the multi-robot system against dynamic environments, an obstacle was suddenly placed during the mission. In Fig. 14, a cuboid obstacle was placed in front of the second mobile robot while it was approaching the next frontier point.   Based on the deep reinforcement learning training, the robot quickly adjusts its path to avoid the unexpected disturbance. The final trajectories of the three robots are illustrated in Fig. 15. No frontier point can be observed in the environment and the cooperative exploration task is accomplished. Besides, it can also be seen that the Voronoi-based strategy minimizes the control effort with respect to the mission completion time and the total travel distances. The video of the experimental results is shown in the Supplementary Material.

V. CONCLUSION
In this paper, a novel cooperative exploration strategy and deep reinforcement learning based mapless collision avoidance algorithm are proposed for multiple mobile robots in unknown environments. In order to navigate and coordinate the networked robots efficiently during the collaborative tasks, dynamic Voronoi partitions are generated to minimize duplicated exploration areas when using multiple robots. A utility function that takes into account both the path cost and the target distance is designed to determine a desired next frontier point such that depth-first and breadth-first modes can be chosen based on different scenarios. To deal with sudden appearance of obstacles in the unknown environment, a deep reinforcement learning based collision avoidance algorithm was proposed which integrated DDPG and PER algorithm to enable the control policy to learn from human demonstration data and improve the learning speed and performance of DDPG. The proposed algorithm was able to perform collision free maneuvering during 20 random missions in the new environment, requiring only 40 minutes of training time. By comparison, the proposed method only needs 14% training steps and 25% of the training time required by DDPG and generates safer and smoother trajectories than DDPG. Simulation results and hardware experiments using real robots are provided to demonstrate the effectiveness of the proposed multi-robot cooperative exploration scheme and learning based collision avoidance algorithm. In the future, finite-time consensus algorithm [37] will be exploited to coordinate heterogeneous robotic systems including both aerial and ground vehicles.