# 一种稳定、精准、实时的语音信号基频的检测与提取算法Robust, Precise and Real-Time Algorithm for Speech Signal Pitch Detection and Extraction

Abstract: A new two-step algorithm was proposed for speech pitch detection and fundamental frequency extraction. This algorithm first estimates a guess of the pitch based on the frequency analysis, and then calculates an accurate solution for the pitch based on time-domain analysis. This algorithm realized the expectation of robust, accurate and real-time. The experimental results show that the performance of this algorithm is better than that of Praat and Adobe Audition in Chinese speech pitch detection and fundamental frequency extraction.

1. 前言

2. 算法设计与实现

$X\left[j-1\right]$X\left[j+1\right]

a) 对语音帧F做快速傅里叶变换，即得到频谱序列P = FFT(F)。如果采样频率是16 kHz，FFT的长度是512，那么FFT的频率分辨率是31.25 Hz。

b) 在频谱序列P上提取两个峰值点PV1，PV2，一个最大值点PK，其中PV1是93~218 Hz之间的峰值点，PV2是218~375 Hz之间的峰值点，PK是93~375 Hz之间的最大值点。注意，PV1，PV2可能不存在，但PK一定存在。

c) 基频F0粗估计。如果PV1存在，则用PV1估计F0；否则，如果PV2存在，则用PV2估计F0；如果PV1，PV2都不存在，则用PK估计F0，得到基频F0的粗估计CF0。

d) 粗估计CF0的调整。针对粗估计CF0是基于峰值点PV1给出的，分为两种情况进行处理：第一种是PV2与PK重合，且PV1不是PV2的半频，则用PV2估计F0。第二种是如果PV2存在，且PV2的峰值比PV1的峰值大，则用PV2估计F0。

e) 对于基频的粗估计CF0，在时域进一步精准估计。首先建立频率Hz与时域的语音序列S的下标之间的对应关系，这种关系是非线性的，为易于实现，用分段线性函数进行了简化，共分为三段，即100~200 Hz，200~300 Hz，300~400 Hz，经过计算，可以得到如下分段函数关系式：

$Y=-0.80X+240$ (2.1)

$Y=-0.27X+134$ (2.2)

$Y=-0.13X+92$ (2.3)

f) 根据基频的粗估计CF0以及上述分段函数关系式，得到CF0在语音序列S中对应的下标SI.在语音序列S中的下标SI附近搜索峰值点SV1，且在下标2 * SI附近搜索峰值点SV2。如果SV1和SV2都存在，则根据SV1和SV2的下标计算出CF0的精确估计F0。

g) 如果粗估计CF0与精确估计F0相差15%以上，直接以粗估计CF0代替精确估计F0。

h) 返回精确估计F0作为本帧语音的基频。

3. 实验结果

Figure 1. The Waveform and the pitch (blue curve, Praat)

Figure 2. The Waveform of SNR = 0 db and the pitch (blue curve)

Figure 3. The Waveform of SNR = 10 db and the pitch (blue curve)

Figure 4. The Waveform of SNR = 20 db and the pitch (blue curve)

Figure 5. The Waveform and the pitch (blue curve, Audition)

Figure 6. The Waveform of SNR = 0 db and the pitch (blue curve)

Figure 7. The Waveform of SNR = 10 db and the pitch (blue curve)

Figure 8. The Waveform of SNR = 20 db and the pitch (blue curve)

Figure 9. The Waveform and the pitch (black curve, TFP)

Figure 10. The Waveform of SNR = 0 db and the pitch (black curve)

Figure 11. The Waveform of SNR = 10 db and the pitch (black curve)

Figure 12. The Waveform of SNR = 20 db and the pitch (black curve)

S1：242，224，228，232，242

S2：280，280，262，266，258，250

S3：250，212，188，182

S4：250，242，232，226，180，166，150，156，160，164

Praat基频标注的结果(Hz)：

S1：231，226，230，235，243

S2：278，282，270，262，256，256

S3：255，201，184，178

S4：252，241，239，222，175，160，151，150，155，165

S1：x，x，x，x，x (x表示数据丢失，不可用)

S2：264，266，258，256，260，264

S3：x，x，x，x

S4：x，x，x，x，162，168，168，168，182，198

TFP基频标注的结果(Hz)：

S1：242，219，229，230，242

S2：281，281，256，258，262，250

S3：250，219，188，186

S4：250，242，246，239，188，165，148，156，163，163

Table 1. The pitch errors (Hz) by Praat and TFP

4. 结论

[1] 张金光. 语言发音模型研究综述[J]. 计算机工程与应用, 2018, 54(12): 27-34.

[2] Zhao, H. and Gan, W.J. (2013) A New Pitch Estimation Method Based on AMDF. Journal of Multimedia, 8, 618-625.
https://doi.org/10.4304/jmm.8.5.618-625

[3] Lin, Q.G. and Shao, Y.W. (2018) A Novel Normalization Method for Autocorrelation Function for Pitch Detection and for Speech Activity Detection. Interspeech, Hyderabad, 2-6 September 2018, 2097-2101.
https://doi.org/10.21437/Interspeech.2018-45

[4] Kim, J.W., Salamon, J., Li, P., et al. (2018) CREPE: A Convolutional Representation for Pitch Estimation. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 15-20 April 2018, 161-165.
https://doi.org/10.1109/ICASSP.2018.8461329

[5] 陈萧, 徐波. 改进的用于口语处理的基频提取算法[J]. 清华大学学报(自然科学版), 2017, 57(1): 95-99.

[6] Kasi, K. and Zahorian, S.A. (2002) Yet Another Algorithm for Pitch Tracking. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Orlando, Vol. 1, 361-364.
https://doi.org/10.1109/ICASSP.2002.5743729

[7] Gonzalez, S. and Brookes, M. (2014) PEFAC-A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 518-530.
https://doi.org/10.1109/TASLP.2013.2295918

[8] Hajimolahoseini, H., Amirfattahi, R., Gazor, S., et al. (2016) Robust Estimation and Tracking of Pitch Period Using an Efficient Bayesian Filter. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24, 1219-1229.
https://doi.org/10.1109/TASLP.2016.2551041

[9] Huang, F. and Lee, T. (2013) Pitch Estimation in Noisy Speech Using Accumulated Peak Spectrum and Sparse Estimation Technique. IEEE Transactions on Audio, Speech, and Language Processing, 21, 99-109.
https://doi.org/10.1109/TASL.2012.2215589

[10] Zhang, X.L., Zhang, H., Nie, S., et al. (2016) A Pairwise Algorithm Using Deep Stacking Network for Speech Separation and Pitch Estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24, 1066-1078.
https://doi.org/10.1109/TASLP.2016.2540805

[11] Miller, N. (2003) Pitch Detection by Data Reduction. IEEE Transactions on Acoustics Speech & Signal Processing, 23, 72-79.
https://doi.org/10.1109/TASSP.1975.1162642

[12] Du, S.C., Sugiura, Y. and Shimamura, T. (2018) Combining Zero Replacement Speech Enhancement with Lag Window Method for Pitch Detection. 2018 IEEE 3rd International Conference on Communication and Information Systems (ICCIS), Singapore, 28-30 December 2018, 53-57.
https://doi.org/10.1109/ICOMIS.2018.8645016

[13] Bahja, F., Martino, J.D., Elhaj, E.I., et al. (2016) A Corroborative Study on Improving Pitch Determination by Time-Frequency Cepstrum Decomposition Using Wavelets. Springerplus, 5, 564.
https://doi.org/10.1186/s40064-016-2162-0

[14] Kaur, R. and Kumar, N. (2015) Review on Multi Pitch Detection in Speech. International Journal of Scientific Research in Computer Science and Engineering, 3, 6-10.

[15] Chu, W. and Alwan, A. (2012) SAFE: A Statistical Approach to F0 Estimation under Clean and Noisy Conditions. IEEE Transactions on Audio, Speech, and Language Processing, 20, 933-944.
https://doi.org/10.1109/TASL.2011.2168518

[16] Nielsen, J.K., Christensen, M.G., Jensen, S.H., et al. (2013) Default Bayesian Estimation of the Fundamental Frequency. IEEE Transactions on Audio, Speech, and Language Processing, 21, 598-610.
https://doi.org/10.1109/TASL.2012.2229979

[17] Jin, Z.Z. and Wang, D.L. (2011) HMM-Based Multipitch Tracking for Noisy and Reverberant Speech. IEEE Transactions on Audio Speech and Language Processing, 19, 1091-1102.
https://doi.org/10.1109/TASL.2010.2077280

[18] Zhang, S. and Shirai, K. (2000) Visual Approach for Automatic Pitch Period Estimation. IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Vol. 3, 1339-1342.

[19] Niu, Y.D., Chen, F. and Chen, J. (2019) The Effect of F0 Contour on the Intelligibility of Mandarin Chinese for Hearing-Impaired Listeners. The Journal of the Acoustical Society of America, 146, 85-91.
https://doi.org/10.1121/1.5119264

[20] 宋黎明, 李明, 颜永红. 谐波显著度的基频提取方法[J]. 声学学报, 2015, 40(2): 294-299.

Top