Information Gathering Actions

A partially observable Markov decision process (POMDP) is a framework for decision-making problems under uncertainty. In environments with incomplete and uncertain information, it can use observations to infer the current state and make optimal decisions. Although POMDPs allow great flexibility in modeling, finding exact solutions of them is generally computationally intractable. Therefore, it is common to approximate the solution with sampling-based approaches.

Sampling-based approaches simulate outcomes by generating samples of states, actions, and observations. The value of information-gathering actions is also evaluated through this sampling process. One well-known sampling-based approach is the Partially Observable Monte Carlo Planning (POMCP) algorithm. POMCP samples individual states, actions, observations, and therefore, cannot consider belief-dependent rewards. It also proves to be inefficient for continuous action and observation spaces. For this reason, we developed a Monte Carlo tree search (MCTS) based approach, the Information Particle Filter Tree algorithm, that can guide an agent to collect more information to make informed decisions.

Information particle filter tree algorithm

Belief-based MCTS One iteration of the IPFT algorithm. Particle sets are simulated through the tree instead of single particles. This allows for evaluating belief-dependent rewards.

The light dark problem

The light-dark benchmark problem defines varying observation densities for different continuous state values. These observations are used to reduce localization uncertainty, which in turn enables the agent to complete the task.

MCTS continuous light-dark problem.

Comparison of actions w/ POMCP(OW)

MCTS continuous light-dark problem actions.

IPFT chooses actions that maximize information gain and after localizing the agent, it immediately approaches the goal.



Comparison between the search trees with and without active information gathering for an intersection crossing scenario where the route of another vehicle is unclear. The algorithm executes contradictory actions to observe the reaction of the other vehicle and in this way to collect more information. The differences in the Q-values is due to the inclusion of information rewards.

Belief-based MCTS
Belief-based MCTS
Publications on this topic Material
Ömer Şahin Taş. Motion Planning for Autonomous Vehicles in Partially Observable Environments. Ph.D. thesis, Karlsruhe Institute of Technology, 2022.
Johannes Fischer, Ömer Şahin Taş. Information Particle Filter Tree: An Online Algorithm for POMDPs with Belief-Based Rewards on Continuous Domains. In International Conference on Machine Learning (ICML), 2020.

Continuity of Rewards and Multi-armed Bandits

Finding exact solutions of POMDPs is generally computationally intractable, but the solution can be approximated by sampling-based approaches. These approaches rely on multi-armed bandit (MAB) heuristics, which assume the outcomes of different actions to be uncorrelated. In some applications, like motion planning in continuous spaces, similar actions yield similar outcomes. We use variants of MAB that make Lipschitz continuity assumptions on the outcomes of actions to improve the efficiency of sampling-based planning approaches.

Multi-hypothesis planning. Q-value profile of a scenario without considering uncertainty.
Multi-hypothesis planning. Q-value profile of a scenario with uncertainty.
Multi-hypothesis planning. Gaussian process regression applied to a Q-value profile by sampling 100 particles.
Lipschitz multi-armed bandit

Convergence analysis for various number of bandit actions.

Publications on this topic Material
Ömer Şahin Taş, Felix Hauser, Martin Lauer. Efficient Sampling in POMDPs with Lipschitz Bandits for Motion Planning in Continuous Spaces. In IEEE Intelligent Vehicles Symposium (IV), 2021.