You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/language-inspired-approaches-phoneme-classification.md
+41-38Lines changed: 41 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,13 +47,27 @@ citations:
47
47
year: 2025
48
48
url: "https://arxiv.org/abs/2506.02098"
49
49
bibtex: "@article{ozdogan2025libribrain,\n title={LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale},\n author={Özdogan, Miran and Landau, Gilad and Elvers, Gereon and Jayalath, Dulhan and Somaiya, Pranav and Mantegna, Francesco and Woolrich, Mark and Parker Jones, Oiwi},\n journal={arXiv preprint arXiv:2506.02098},\n year={2025}\n}"
50
+
- id: "mantegna2025braininsp"
51
+
title: "Brain-Inspired Approaches to Speech Detection"
bibtex: "@misc{mantegna2025brainInspired,\n title={Brain-Inspired Approaches to Speech Detection},\n author={Mantegna, Francesco and Elvers, Gereon and Parker Jones, Oiwi},\n year={2025},\n url={https://neural-processing-lab.github.io/2025-libribrain-competition/blog/brain-inspired-approaches-speech-detection},\n note={Blog post}\n}"
57
+
- id: "landau2025speechref"
58
+
title: "The Speech Detection task and the reference model"
bibtex: "@misc{landau2025speechref,\n title={The Speech Detection task and the reference model},\n author={Landau, Gilad and Elvers, Gereon and Özdogan, Miran and Parker Jones, Oiwi},\n year={2025},\n url={https://neural-processing-lab.github.io/2025-libribrain-competition/blog/speech-detection-reference-model},\n note={Blog post}\n}"
50
64
---
51
65
52
66
### **Introduction**
53
67
54
68
In the 2025 PNPL Competition ([Landau et al. 2025](https://arxiv.org/abs/2506.10165)), phoneme classification is presented as a categorical problem—given neural signals, predict which of the 39 ARPABET phonemes was heard. In [a previous blog](https://neural-processing-lab.github.io/2025-libribrain-competition/blog/brain-inspired-approaches-speech-detection/), we suggested some neuroscience-inspired ideas for the speech detection task. Here, we suggest linguistics-inspired ideas for phoneme classification.
55
69
56
-
#####**The ARPABET Phoneme Set**
70
+
### **The ARPABET Phoneme Set**
57
71
58
72
Before exploring the idea of classifying phonetic features, let's establish the complete ARPABET inventory we're working with:
59
73
@@ -82,11 +96,11 @@ Before exploring the idea of classifying phonetic features, let's establish the
Phonetic features offer several compelling advantages over direct phoneme classification, particularly for MEG data where training examples may be limited:
88
102
89
-
### **1. Data Efficiency Through Shared Structure**
103
+
####**1. Data Efficiency Through Shared Structure**
90
104
91
105
Consider these phoneme pairs and their shared features:
92
106
@@ -96,7 +110,7 @@ Consider these phoneme pairs and their shared features:
96
110
97
111
Instead of learning 39 independent phoneme categories, the model can learn combinations of ~30 features. This shared structure enables transfer learning—knowledge about [voicing] learned from /p/ vs /b/ pairs transfers to help distinguish /s/ vs /z/, /f/ vs /v/, and other voicing contrasts. In addition, the model is able to learn a more abstract representation of speech, structured around how the phonemes are articulated in the mouth, which could also benefit the classification of other phonemes such as those not in the English language without further training data.
98
112
99
-
### **2. Handling Low-Frequency Phonemes**
113
+
####**2. Handling Low-Frequency Phonemes**
100
114
101
115
Some phonemes occur infrequently in speech corpora. Looking at the actual distribution from the LibriBrain dataset ([Özdogan et al. 2025](https://arxiv.org/abs/2506.02098)), phoneme frequencies vary dramatically:
102
116
@@ -120,23 +134,23 @@ Direct classification might struggle with limited training data for rare phoneme
120
134
121
135
The model can then recognise /ŋ/ as the intersection of [nasal] + [velar] + [voiced], even with limited direct /ŋ/ examples.
122
136
123
-
### **3. Biological Plausibility**
137
+
####**3. Biological Plausibility**
124
138
125
139
Neurophysiological evidence suggests the brain may encode speech in terms of articulatory features rather than whole phonemes (e.g. [Mesgarani et al. 2014](https://pmc.ncbi.nlm.nih.gov/articles/PMC4350233/)). MEG signals might naturally align with feature-based representations, potentially improving classification accuracy.
126
140
127
-
### **4. Graceful Degradation**
141
+
####**4. Graceful Degradation**
128
142
129
143
When predictions are uncertain, feature-based models provide partial information. Instead of a completely wrong phoneme prediction, you might get the correct manner of articulation ([fricative]) even if the place of articulation ([alveolar] vs [postalveolar]) is incorrect.
130
144
131
-
#####**Universal Phonetic Features: The IPA Foundation**
145
+
### **Universal Phonetic Features: The IPA Foundation**
132
146
133
147
The International Phonetic Alphabet (IPA) provides a systematic framework for describing speech sounds based on how they are articulated.
*Figure: The IPA consonant chart shows all known and even possible consonants in human languages produced with air from the lungs (from [Wikipedia](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet#/media/File:IPA_chart_2020.svg), accessed 21 September 2025)*
138
152
139
-
### **Consonant Features**
153
+
####**Consonant Features**
140
154
141
155
For the 24 consonants in English, we can define features based on three primary dimensions:
142
156
@@ -170,7 +184,7 @@ For the 24 consonants in English, we can define features based on three primary
170
184
-**Voiced**: vocal cords vibrate during production
171
185
-**Voiceless**: no vocal cord vibration
172
186
173
-
### **Vowel Features**
187
+
####**Vowel Features**
174
188
175
189
For the 10 vowels, the core IPA features include:
176
190
@@ -184,7 +198,7 @@ This finer-grained system that specifies multiple degrees of height, for example
184
198
185
199
*Figure: Positions in the IPA vowel chart correspond to tongue position (from [blog](https://www.languagejones.com/blog-1/2016/12/24/why-the-international-phonetic-alphabet-ipa-is-the-best-thing-ever), accessed 21 September 2025)*
186
200
187
-
### **Diphthong Features**
201
+
####**Diphthong Features**
188
202
189
203
For diphthongs, ARPABET uses the following digraphs (AY, AW, EY, OY, OW). Diphthongs can be approximated by a sequence of two other IPA symbols for the English monophthong vowels, each with their own articulatory features:
190
204
@@ -198,11 +212,11 @@ These differ from English monophthongs like **IH** /ɪ/, which map to single IPA
198
212
199
213
Incidentally, some analyses of English vowels prefer features like **tenseness** (tense vs lax) to distinguish vowels like /i/, /o/, /u/ from /ɪ/, /ɔ/, /ʊ/. This is not an IPA feature but rather represents a language-specific categorisation that correlates in the IPA with the more precise height and backness distinctions above.
200
214
201
-
#####**Complete IPA-Based Feature Set for ARPABET**
215
+
### **Complete IPA-Based Feature Set for ARPABET**
202
216
203
217
Here's a template binary feature matrix for the ARPABET phonemes:
@@ -252,11 +266,11 @@ This means that there are no separate entries for diphthongs which solves the pr
252
266
253
267
Note that affricates like /dʒ/ could also be separated into simpler phonemes (/d/ and /ʒ/), but the same difficulty with assigning features to them does not arise.
254
268
255
-
#####**Alternative Feature Sets**
269
+
### **Alternative Feature Sets**
256
270
257
271
While IPA-based features provide a solid foundation based on articulatory features, it is not the only option. For example, sets that mix articulatory features (as in the IPA) with other kinds of features can be seen in prior studies of the brain.
258
272
259
-
### **Mixed Feature Sets in Neuroscience**
273
+
####**Mixed Feature Sets in Neuroscience**
260
274
261
275
The influential work by [Mesgarani et al. (2014)](https://pmc.ncbi.nlm.nih.gov/articles/PMC4350233/) used surgical cortical recordings from human superior temporal gyrus (STG) during natural speech listening to demonstrate that individual brain sites show selectivity to distinct phonetic features rather than whole phonemes. Their study used a mixed feature set combining the following:
262
276
@@ -272,7 +286,7 @@ With this combination of feature sets, they found that:
272
286
273
287
One of the limitations, however, with the Mesgarani study was that they used an incomplete set of 14 phonetic features that do not distinguish all English phonemes. To be clear, 14 binary features could be enough to distinguish 39 phonemes, as we explain below. But the choice of specific features used in the study does not end up separating all phonemes. So additional features would be needed to produce a complete classification system - though as we note near the conclusion, there are probably benefits even for the use of partial feature sets.
274
288
275
-
#####**The Mathematics of Feature Space**
289
+
### **The Mathematics of Feature Space**
276
290
277
291
How many binary features do we need to uniquely represent n phonemes? The theoretical minimum follows from information theory:
278
292
@@ -282,7 +296,7 @@ For 39 ARPABET phonemes: ⌈log₂(39)⌉ = ⌈5.29⌉ = 6 binary features
282
296
283
297
However, this assumes optimal encoding. Linguistically-motivated features typically require more dimensions for interpretability and biological plausibility.
284
298
285
-
### **Why Binary Features?**
299
+
####**Why Binary Features?**
286
300
287
301
By convention, linguists often encode phonetic properties as binary features (present=1, absent=0). Features like [±voiced] or [±nasal] naturally divide phonemes into two groups. From a machine learning perspective, binary features also allow us to perform binary classification, which has numerous benefits including simplifying the hypothesis space for the model, making training more efficient and inference more robust, and avoiding the need to impose arbitrary ordinal relationships between categories.
288
302
@@ -303,7 +317,7 @@ With the phoneme **IH** /ɪ/, we can attribute the following features:
303
317
304
318
The rest of the vowel phoneme features can be assigned in a similar way. The full feature attribution can be seen above in the full IPA-based binary feature matrix.
305
319
306
-
### **Diphthong Challenge**
320
+
####**Diphthong Challenge**
307
321
308
322
Diphthongs like /aɪ/ move between two vowel targets, making single feature assignment problematic. We present here several ways to model diphthongs with their pros and cons:
309
323
@@ -337,15 +351,15 @@ We can use higher-level (more abstract) phonological distinctions. For example,
**Cons**: Requires domain knowledge, may miss low-level articulatory details
339
353
340
-
#####**Brute Force Feature Discovery**
354
+
### **Brute Force Feature Discovery**
341
355
342
356
Since the "correct" feature set for neural representation is unknown, we can systematically search the feature space:
343
357
344
-
### **Approach 1: Exhaustive Binary Search**
358
+
####**Approach 1: Exhaustive Binary Search**
345
359
346
360
For k binary features representing n phonemes, we have a total of 2^(n×k) possible feature assignments. With the constraint that each phoneme must have a unique feature vector (i.e. considering only assignments where each of the n assigned feature vectors are distinct from each other), we can train a classifier to map MEG data to the assigned binary features and measure how accurately it predicts the binary features from the neural input. The set of binary feature vectors with the highest model accuracy would be the best set to use. However, although this approach is tractable for small k, it grows exponentially which may be computationally infeasible for larger k.
347
361
348
-
### **Approach 2: Evolutionary Search**
362
+
####**Approach 2: Evolutionary Search**
349
363
350
364
A similar but less exhaustive approach involves modifying the best-performing feature sets and evaluating model performance on the mutated feature sets. The following steps can be taken for the evolutionary search approach:
351
365
@@ -355,7 +369,7 @@ A similar but less exhaustive approach involves modifying the best-performing fe
355
369
4. Apply mutation/crossover to generate new candidates
356
370
5. Repeat until convergence
357
371
358
-
#####**General Implementation Strategy**
372
+
### **General Implementation Strategy**
359
373
360
374
Here, we provide some general implementation tips which could help with the development of robust and accurate models for phoneme classification.
361
375
@@ -369,11 +383,11 @@ Here, we provide some general implementation tips which could help with the deve
369
383
5.**Search Strategically**: Use evolutionary or gradient methods to discover novel feature combinations
370
384
6.**Validate Interpretability**: Ensure discovered features have linguistic or neurobiological interpretation
371
385
372
-
#####**Practical Implementation for the Competition**
386
+
### **Practical Implementation for the Competition**
373
387
374
388
The competition task requires mapping from brain data to probability distributions over all 39 ARPABET phonemes. However, you can implement feature-based classification internally while still meeting this requirement through conversion:
375
389
376
-
### **Feature-to-Phoneme Conversion Pipeline**
390
+
####**Feature-to-Phoneme Conversion Pipeline**
377
391
378
392
1.**Train feature classifiers**: Build separate binary classifiers for each phonetic feature (e.g. [voiced], [fricative], [front])
379
393
2.**Predict feature probabilities**: For each input, obtain probability estimates for all features
@@ -382,7 +396,7 @@ The competition task requires mapping from brain data to probability distributio
382
396
-**Learned mapping**: Train a secondary classifier to map from feature space to phoneme space
383
397
-**Probabilistic matching**: Use Bayesian inference to compute P(phoneme|features)
Since conversion back to phonemes is always possible, you don't need a complete feature set:
441
455
442
456
1.**Start with consonants only**: Implement features for the 24 consonants (manner, place, voicing), use direct classification for vowels/diphthongs
443
457
2.**Add vowel subsets**: Gradually incorporate vowel features (height, backness, rounding) as you refine the approach
444
458
3.**Handle diphthongs last**: These are the most complex; initially treat them as single units or use simplified approximations
445
459
446
-
### **Potential Benefits of Partial Feature Implementation**
460
+
####**Potential Benefits of Partial Feature Implementation**
447
461
448
462
-**Reduced complexity**: Focus on phoneme classes where features are most clear-cut (consonants)
449
463
-**Faster iteration**: Test feature-based approaches without solving all edge cases upfront
@@ -478,14 +492,3 @@ Landau, G., Özdogan, M., Elvers, G., Mantegna, F., Somaiya, P., Jayalath, D., K
478
492
479
493
Özdogan, M., Landau, G., Elvers, G., Jayalath, D., Somaiya, P., Mantegna, F., Woolrich, M., & Parker Jones, O. (2025). LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale. NeurIPS, Datasets & Benchmarks Track. [https://arxiv.org/abs/2506.02098](https://arxiv.org/abs/2506.02098)
480
494
481
-
### Citation
482
-
483
-
```bibtex
484
-
@misc{pnpl_blog2025phoneme_ideas,
485
-
title={Language-Inspired Approaches to Phoneme Classification},
486
-
author={Kwon, Teyun and Cho, SungJun and Elvers, Gereon and Mantegna, Francesco and Parker Jones, Oiwi},
0 commit comments