spacy's general design philosophy is that the Doc owns the data and Spans and Tokens are just views of this data. it makes sense to replicate this, especially to handle cases where the phoneme data doesn't cleanly align to Tokens (for which we could maybe even employ Alignment).