This analysis examines which clustering algorithms best capture the natural phonetic variation in American English vowels and how their performance varies across gender and age. Specifically, it investigates whether vowel categories exist as naturally distinct acoustic categories or if they function as continuous distributions that require learned linguistic boundaries to be distinguished. I chose to compare a supervised learning model (logistic regression) to an unsupervised clustering algorithm (K-Means).
The data was filtered to remove missing values and stratified into four demographic groups (men, women, boys, girls) to control for physiological differences in vocal tract length. I used the steady-state acoustic measurements provided in the dataset as the input features for the models: F1_SS (correlating with vowel height) and F2_SS (correlating with vowel backness). Because distance-based algorithms like K-Means are sensitive to scale, these features were standardized (Z-scored) using StandardScaler so that F1 and F2 contributed equally to the distance calculations.
(chart to be added)
As predicted, the supervised logistic regression model vastly outperformed the unsupervised K-Means algorithm across all demographic groups. The K-Means algorithm struggled to find natural clusters, yielding low Silhouette scores across the board. A score near zero indicates overlapping clusters. In contrast, when the Logistic Regression model was provided with linguistic labels, it achieved high classification accuracies. This contrast supports the hypothesis that American English vowels do not form naturally discrete acoustic islands based on F1 and F2 alone; they exist on a continuum where linguistic boundaries must be learned. The hypothesis that adults would exhibit higher classification accuracy than children was supported. However, it is important to note that the 95% confidence intervals overlap significantly between women and the two children's groups, so the difference in separability between them in this sample may not be statistically robust.
The confusion matrices visually confirm the nature of the phonetic overlap. Across all demographics, the "corner vowels" (e.g., /iy/ as in "beet", /uw/ as in "boot", and /aa/ as in "cot") were classified with high accuracy, often approaching nearly 100%.
(all four confidence charts to be added)
Across the board, the model struggles to perfectly separate the "eh" in bet (/eh/) from the "a" in bat (/ae/). It also frequently mixes up the vowels in “book” (/uh/), “but” (/ah/), and “bird” (/er/). This makes sense, as these sounds are produced closer together in the middle of the mouth and sound very similar. The men's group has the cleanest, darkest diagonal line with the fewest errors, meaning their vowel sounds were the easiest for the model to separate; the adult women's group is also highly accurate. However, the matrices for the children are slightly less accurate. This aligns with the phonetic theory that adults have fully developed vocal tracts and stable, consistent motor control over their speech. Since children are still physically developing, their pronunciation varies more from word to word, causing their vowel categories to overlap significantly more than adults.