We will evaluate speech segmentation based on discussion from Castan et al (2015), which in turn uses the same evaluation metric from NIST RT project. (RT-09 evaluation plan). This software might be helpful: https://github.com/nryant/dscore#rttm .
As we couldn't find a public dataset with audio mixed with speech, music, and other noise, we will annotate a small number of audio files picked from vairous collections of AAPB to create a evaluation dataset.