Change in protein stability, quantified as the change in Gibbs free energy of folding ({Delta}{Delta}G) in kcal/mol, plays a crucial role in functional alterations of proteins, with misfolding and destabilization commonly associated with pathogenicity. The past two decades have brought the development of bioinformatics tools leveraging evolutionary knowledge, empirical force fields, and machine learning to predict stability alterations. However, existing tools are often optimized towards or trained on limited experimental data, leading to unbalanced datasets and potential overfitting. The research objective is to benchmark selected stability predictors using an unbiased and balanced dataset, with AlphaFold structures as input. We demonstrate a performance decline when balancing data across amino acids, stabilizing and destabilizing mutations, and protein representatives, highlighting that redundancy alone is insufficient for benchmarking correction. Additionally, we illustrate that a protein structure ensemble from molecular dynamics acts as a superior input compared to a single static structure. At the same time, coarse-grained methodologies tend to decrease the output quality.