Face perception happens dynamically over time and primarily in three-dimensional space. Perceived similarity, including identity, should ideally remain invariant to changes along these dimensions. Surprisingly, much of our knowledge about face representations stems from static presentations of 2D images, which might not sufficiently capture real-world dynamic face processing. To test the effect of space and time on face similarity judgements, we conducted a pre-registered (https://osf.io/678uh) experiment using a triplet odd-one-out task in a static 2D and a dynamic 3D condition. We then trained sparse and deep computational encoding models of human face similarity judgements to investigate the latent representations underlying their predictions. Aggregated over all faces, we found a strong correlation between viewing conditions, indicating a consistent processing of face similarity between 2D and 3D. Despite these similarities, our encoding models revealed subtle differences between viewing conditions, where a small set of face features, such as distance between chin and cheeks, eye size, nose shape, and particularly in the 3D condition, face-width-height ratio, explained much of the variance in human judgements. Our openly available data and encoding models lay the groundwork for understanding face similarity judgements, which are crucial for our ability to recognise and identify faces in a dynamically changing environment.