There is an ongoing debate about the role of artificial networks in understanding the visual brain. Internal representations of images in artificial networks develop human-like properties. This study analyzes the different factors involved in the emergence of human-like behavior: function, architecture, and environment. To achieve this, the correlation between human perception and artificial networks is evaluated at different depths of 46 pre-trained model configurations that do not include any psycho-visual information.
The results indicate that most models exhibit a stronger correlation with human opinion than SSIM, a de facto standard in subjective image quality. Furthermore, some models outperform state-of-the-art networks specifically tuned for this application (LPIPS, DISTS).
With regard to function, supervised classification produces networks that correlate more closely with human perception than the explored self-supervised and non-supervised models. However, superior task performance does not necessarily imply more human-like behavior.
In terms of architecture, simpler models demonstrate a stronger correlation with human perception compared to deeper networks. Additionally, the highest correlation is generally not observed in the final layer.
Finally, concerning the environment, training with large natural datasets results in stronger correlations with human perception than training on smaller, content-restricted databases, as expected. The findings also indicate that the most effective classification models are not necessarily the best at predicting human perceptual distances. More information.
