Did you consider using more datasets? And how about regression problems? There is for example this benchmarking suite, accessible via the OpenML packages: [https://arxiv.org/abs/1708.03731](https://arxiv.org/abs/1708.03731)