## Information Gathering Actions

A partially observable Markov decision process (POMDP) is a framework for decision-making problems under uncertainty. In environments with incomplete and uncertain information, it can use observations to infer the current state and make optimal decisions. Although POMDPs allow great flexibility in modeling, finding exact solutions of them is generally computationally intractable. Therefore, it is common to approximate the solution with sampling-based approaches.

Sampling-based approaches simulate outcomes by generating samples of states, actions, and observations. The value of information-gathering actions is also evaluated through this sampling process. One well-known sampling-based approach is the Partially Observable Monte Carlo Planning (POMCP) algorithm. POMCP samples individual states, actions, observations, and therefore, cannot consider belief-dependent rewards. It also proves to be inefficient for continuous action and observation spaces. For this reason, we developed a Monte Carlo tree search (MCTS) based approach, the Information Particle Filter Tree algorithm, that can guide an agent to collect more information to make informed decisions.

#### Information particle filter tree algorithm

One iteration of the IPFT algorithm. Particle sets are simulated through the tree instead of single particles. This allows for evaluating belief-dependent rewards.#### The light dark problem

The light-dark benchmark problem defines varying observation densities for different continuous state values. These observations are used to reduce localization uncertainty, which in turn enables the agent to complete the task.

#### Comparison of actions w/ POMCP(OW)

IPFT chooses actions that maximize information gain and after localizing the agent, it immediately approaches the goal.

Comparison between the search trees with and without active information gathering for an intersection crossing scenario where the route of another vehicle is unclear. The algorithm executes contradictory actions to observe the reaction of the other vehicle and in this way to collect more information. The differences in the Q-values is due to the inclusion of information rewards.

## Continuity of Rewards and Multi-armed Bandits

Finding exact solutions of POMDPs is generally computationally intractable, but the solution can be approximated by sampling-based approaches. These approaches rely on multi-armed bandit (MAB) heuristics, which assume the outcomes of different actions to be uncorrelated. In some applications, like motion planning in continuous spaces, similar actions yield similar outcomes. We use variants of MAB that make Lipschitz continuity assumptions on the outcomes of actions to improve the efficiency of sampling-based planning approaches.

Convergence analysis for various number of bandit actions.