Learning Multi-agent Multi-machine Tending by Mobile Robots

Introduction

Robotics can help address the growing worker shortage challenge of the manufacturing industry. As such, machine tending is a task collaborative robots can tackle that can also highly boost productivity. Nevertheless, existing robotics systems deployed in that sector rely on a fixed single-arm setup, whereas mobile robots can provide more flexibility and scalability.

In this work, we introduce a multi-agent multi-machine tending learning framework by mobile robots based on Multi-agent Reinforcement Learning (MARL) techniques with the design of a suitable observation and reward. Moreover, an attention-based encoding mechanism is developed and integrated into Multi-agent Proximal Policy Optimization (MAPPO) algorithm to boost its performance for machine tending scenarios.

Our model (AB-MAPPO) outperformed MAPPO in this new challenging scenario in terms of task success, safety, and resources utilization. Furthermore, we provided an extensive ablation study to support our various design decisions.

Here we provide complementary materials for our paper.

Topics covered

  • Simulaton Envionment.
  • The parameters of our model (AB-MAPPO).
  • More Experiments and Results
  • Experiment on Real Robots.
  • Possible Future Directions.

Simulaton Envionment

In this section we provide more insights about the VMAS simulation envionment.

Observation

Many simulation environments simplify the world representation as an occupancy grid and represent the position of objects by the index of the cell they occupy in the grid. However, VMAS uses a more representative approach by using continuous observation space. Where the origin (0.0) is considered to be at the center of the arena, and objects' positions, and velocities are described as continuous offset and rate of change in the x and y directions.

Action

In VMAS agents' actions are represented as physical forces fx and fy with optional communication action (that we did not use in this work). At each time step, their physics engine uses the forces at time t, to calculate the world state. They take into account the effect of all factors like the velocity damping, gravity, and forces due to the collisions with other objects[1]. They support continuous actions and discrete actions that are converted internally to continuous actions too. We used discrete actions that have 5 possible values (0: translated (fx=0, fy=0) for doing nothing, 1: translated to (-1, 0) resulting in the agents accelerating towards the left, 2: translated to (1, 0) and the agents accelerate towards the right, similarly 3 for down and 4 for up)

Reward

VMAS allow you to design the reward function for each agent at each step based on the envionment state. So, this allowed us to design the complex dense reward presented in the paper.

Various model parameters

This section lists the parameters of our best performing model.
Parameter Value
Reward SharingFalse
Fixed Uncollected Parts PenalityTrue
Encoding vector Size18
Object embedding Size18
Number of Attention Heards3
Attention embedding Size18
Attention embedding Size18
Learning Rate 7e-4
Optimizer Adam

More Experiments' Results

More Ablation Study

Here we compare the performance with possible different design options for our attention mechanism. For instance, Exp1 is for AB-MAPPO without preparing attention vectors, object embedding, or observation concatenation steps. Exp2 is for AB-MAPPO with observation concatenation but does not prepare attention vectors or do object embedding steps. Lastly, Exp3 is AB-MAPPO only without preparing attention vectors, which means the model will do self-attention because Q, K, and V will be the same vector. Exp4 is with 1 attention head instead of 3, smaller attention embedding size (16 instead of 18) and no observation encodign step.

Experiment Collected Delivered Collisions Avr(MU) Avr(AU)
AB-MAPPO 11.86 (0.31) 10.49 (0.43) 8.99 (1.87) 0.59 (0.16) 0.59 (0.04)
Exp1 11.39 (0.25) 10.09 (0.08) 12.02 (2.81) 0.57 (0.06) 0.57 (0.06)
Exp2 11.49 (0.48) 10.15 (0.32) 9.22 (1.41) 0.57 (0.06) 0.58 (0.04)
Exp3 10.53 (1.76) 9.3 (1.56) 14.61 (10.22) 0.52 (0.02) 0.53 (0.08)
Exp4 9.91 (1.2) 8.75 (1.08) 17.64 (4.69) 0.5 (0.16) 0.49 (0.06)

More Qualitative Results

This section presents videos of AB-MAPPO perofrmance in various envionemnt setups. Those videos are from episodes collected randommly towards the end of the training of the model.

Envionment Setups Where the model performed well:

Envionments with less optimal performance

Experiment on Real Robots

Small tabletop robots were selected to showcase this work on real robots. Our tabletop robots are custom-made and adapted from the original Stanford's Zooids [2]. For onboard control and communication, they rely on a 48MHz ARM microcontroller (STM32F051C8) with a 2.4GHz radio chip (NRF24L01+).
They are energized by a small 100 mAh lipo battery and employ differential drive mode actuated by two small DC motors. For localization, two photodiodes are positioned on top to detect and decode gray-coded patterns projected from the top by a 3000Hz projector (DLP LightCrafter 4500).
Zooids Structure

For the experiment setup, we designed a real-world arena (shown below in the video below) that mimics the one in the simulation with the whit-ish box in the middle representing the storage area. The first box (from top) in the right and left represent the machines that produce the parts. The two black boxes below each machine represent obstacles/machine blockers. It is worth mentioning that this experiment is for the model before adding the attention, we expect even better performance with the attention change

This setup is used to showcase the model in real robots by rendering the simulation in real robots. So, the simulation is always running in a ground station, with one agent in the simulation for each robot in the arena. Then, the position of each agent in the simulation is sent as a position command for the corresponding robot in the arena to go to.

Despite the difference in dynamics (omnidirectional robots, compared to differential motion robots and smaller size machines and storage area) the robots were able to depict the same behavior as in the simulation, going to the machines to get the ready part and then going to the storage area to deliver it.

Promissing Furture Directions

Here we cover some of the directions that we started to invistgate and did not give a good performance, but, we think they can be further invistigated in further works.

Attention for actor

We studied integrating our attention-based mechanism in the Actor too. For the Actor, we followed a similar structure to our critic, but because the Actor is decentralized by design, and only have access to the agent observation, we changed the design of our attention-based encoding mechanism to learn the relationship between different objects (the agent itself, other agents, machines, and storage area) in the agent observation. The design is shown below.

Actor Design

We then conducted experiments for combining our AB-MAPPO with different variations of our attention-based encoding for the Actor network. For instance, Exp1 used AB-MAPPO, plus the same architecture above for the Actor network with 3 attention heads, and object encoding and attention embedding size of 15. Exp2 is similar to Exp1 but with an object encoding and attention embedding size of 18. Exp3 is similar to Exp1, but with object encoding 12 and attention embedding 9. Exp4 is similar to Exp1 but with object encoding and attention embedding 9. Exp5 with object encoding and attention embedding 6. Exp6 is similar to Exp1 but with a single attention head, object encoding size 6, and attention embedding 6. Exp7 with single attention head, object encoding size 18, and attention embedding 6. Exp8 with single attention head, object encoding size 9, and attention embedding 6.

Experiment Collected Delivered Collisions Avr(MU) Avr(AU)
AB-MAPPO 11.86 (0.31) 10.49 (0.43) 8.99 (1.87) 0.59 (0.16) 0.59 (0.04)
Exp1 9.75 (1.44) 8.61 (1.27) 15.11 (6.48) 0.49 (0.05) 0.49 (0.02)
Exp2 9.44 (0.64) 8.35 (0.58) 18.9 (3.58) 0.47 (0.02) 0.47 (0.04)
Exp3 7.97 (0.0) 7.0 (0.0) 20.48 (0.0) 0.4 (0.38) 0.4 (0.12)
Exp4 11.88 (0.31) 10.37 (0.1) 9.35 (2.25) 0.59 (0.07) 0.59 (0.06)
Exp5 9.79 (0.58) 8.43 (0.38) 20.74 (6.14) 0.49 (0.2) 0.49 (0.04)
Exp6 10.28 (1.98) 9.11 (1.74) 10.77 (3.69) 0.51 (0.26) 0.51 (0.03)
Exp7 10.26 (1.94) 8.98 (1.65) 13.63 (4.88) 0.51 (0.0) 0.51 (0.04)
Exp8 11.03 (1.64) 9.64 (1.43) 12.97 (5.47) 0.55 (0.1) 0.55 (0.07)

Curriculum Learning

Due to the complexity of the problem, we thought of using curriculum learning to teach the model to put more focus on one aspect of the problem. For example, focus more on safety and collision reduction without a huge loss in the performance of the main task (parts collection and delivery). In this setup, the model is trained with our new reward setup until it reaches an acceptable performance. Then, the collision penalty is gradually increased so the agents can focus more on collision avoidance. This track was investigated in parallel to model design improvement, so we used MAPPO without our attention-based encoding as the baseline for those experiments. Exp1 is the baseline trained for 22200 episodes, Exp2 is the same model trained with curriculum learning, so for the first 10000 episodes it used the same reward as the baseline, then after every 2000 episodes, the collision penalty was increased by 2, reaching a value of -22 at episode 20000 without further any increase till the end of the 22200 episodes. Exp3 is for the model trained from the beginning with the increased collision penalty.

Experiment Collected Delivered Collisions Avr(MU) Avr(AU)
Exp1 10.94 (2.09) 9.51 (1.87) 13.91 (9.09) 0.55 (0.01) 0.55 (0.05)
Exp2 7.58 (0.6) 6.66 (0.51) 19.33 (2.21) 0.38 (0.37) 0.38 (0.1)
Exp3 8.61 (2.01) 7.47 (1.71) 11.28 (4.72) 0.43 (0.18) 0.43 (0.06)

As expected training from the beginning with a higher collision penalty resulted in a drop in parts collection and delivery but less collision (Exp3) Unexpectedly, on average, curriculum learning (Exp2) did not give less drop in the parts collection and delivery, but we noticed that some of the seeds gave promising results, so we think this track is still worth more invistigation.

References

  1. Bettini, M., Kortvelesy, R., Blumenkamp, J., & Prorok, A. (2022, November). Vmas: A vectorized multi-agent simulator for collective robot learning. In International Symposium on Distributed Autonomous Robotic Systems (pp. 42-56). Cham: Springer Nature Switzerland. Link
  2. Le Goc, M., Kim, L. H., Parsaei, A., Fekete, J. D., Dragicevic, P., & Follmer, S. (2016, October). Zooids: Building blocks for swarm user interfaces. In Proceedings of the 29th annual symposium on user interface software and technology (pp. 97-109). Link