INTRODUCTION: One of the goals of remote sensing (RS) image understanding is to provide a comprehensive, human-like interpretation of the data that is accessible to users who lack expertise in RS (Tuia et al., in press). This boils down to systems that can understand natural human expressions and reasoning, thereby making the information extraction process more intuitive and interactive. Despite this overarching goal, several recent approaches remain highly specialized, specific to single tasks, and become some form of optimized classification or semantic segmentation of the images. Therefore, a clear gap remains between these techniques and the end users. [...]