In the last post about my vocal synthesis project, I talked about implementing the Wide-Band Voice Pulse Modeling algorithm. Since then, I've actually done some original research of my own and have devised what I believe to be three minor improvements to the algorithm.
I implemented the Wide-Band Voice Pulse Modeling algorithm (from Dr. Jordi Bonada's PhD thesis: https://www.tdx.cat/bitstream/handle/10803/7555/tjbs.pdf) via the upsampling method (specifically, upsampling via a natural cubic spline). There are actually two methods proposed in that paper, the other being via periodization. There is actually a patent that pertains to WBVPM, but it only covers the periodization version (which is what they used for their results), so I have implemented the upsampling method instead. I have been able to validate the main results in that paper; specifically, its shape-invariance and lower residual when compared to other methods. Furthermore, I have devised three significant improvements to the algorithm - two of which are only possible because I used the spline approach, so in a sense it was good that I had to do it that way.
Of the three improvements, I have implemented the first two and shown their advantage of the original WBVPM algorithm. The resulting score has been obtained by taking the mean of the relative of residual level (i.e. the difference between the original and reconstructed signal; relative to the level original signal). I have done so on an audio sample that deliberately exhibits traits that were noted as negatively affecting the WBVPM algorithm's resulting quality. Notably, a low pitch voice with rapid and deep vibrato, transients, strong amplitude modulation, and a large portion of the sampling being between a voiced/unvoiced/voiced transition.
First I should note that my WBVPM implementation is currently far from optimal. The pitch estimation system (via the modified TWM algorithm) has not undergone testing and tuning of its parameters, and there are many variations of the TWM algorithm to consider. Additionally, I have not implemented unvoiced/voiced detection (because, as far as I can tell, it is not mentioned in Bonada's thesis; presumably it's in prior literature, but I have not researched it yet), so all the algorithms act as if they are always processing a voiced signal even when they are not.
RESILIENT BORDER INTERPOLATION IN SYNTHESIS - When I first implemented the synthesis step for WBVPM, it was late at night and I was tired. I wanted a quick result before I went to bed and didn't understand the wording of the description of the synthesis step in WBVPM. As such, my original implementation differed significantly. Instead of using the overlap-and-add, it instead, for each sample, found the closest voice pulse and determined its value for that time, taking advantage of the spline that was generated for downsampling and using the periodic nature of the pulse to extend it when the sample was beyond its domain (i.e. the opposite of overlapping). This approach lead to high-frequency crackling artifacts due to discontinuities between the voice pulse boundaries.
The following day, I properly understood the synthesis approach and rewrote the synthesis code. Interestingly, this actually gave worse overall results. While the high frequency artifacts were gone, there were now large low frequency artifacts that appeared as large modulations in the time-domain. I eventually tracked this down to being a bug in my implementation of the MFPA algorithm that sometimes resulted in massive errors of up to 1.5 radians. I fixed this bug and the reconstruction synthesis no longer had significant artifacts, but I thought it was interesting that my approach, despite having the discontinuity issue, was more resilient to errors in the MFPA estimation. I began thinking if the two approaches could be combined to create an even better approach.
I was thinking about why the modulation occurred in the case of the overlap-and-add method. Thinking about it, when the fundamental frequency is stationary and the MFPA onsets are perfect, the trapezoidal window function is equivalent to a weighted average between two adjacent voice pulses over the duration of twice the border interpolation size. However, when the MFPA onsets are inaccurate, or even just when the fundamental frequency is non-stationary, this is no longer true. Even worse, thinking about it from the weighted average point of view, the sum isn't necessarily one everywhere anymore, hence the modulation.
I then devised a method that would not result in modulation. This method works by first synthesizing the 'inner' portion of each pulse (by 'inner', I mean starting at the end of the border interpolation at the start, and ending before the start of the next border interpolation towards the end of the pulse). Then, for the gaps in between each pulse, we calculate each sample value by a weighted average of two values. These are values are the values of each voice pulse at that time. Since the gap extends beyond the boundaries of each voice pulse, we use the periodic nature of the pulses to compute the effective position in the voice pulse by taking the position modulo the period of the fundamental frequency at that voice pulse. The fundamental frequencies of each of the voice pulses may differ, so we actually change step in time linearly. At each end of the gap, the step size for the voice pulse it is next to is one sample in time, while the step for the former voice pulse is the equivalent of one sample in the latter voice pulse relative to the former's fundamental frequency (e.g. if the second voice pulse has twice the fundamental frequency as the first; the step size for the first would be 2 and tep size for the second would be 1, at the end of the gap). For the start of the gap, it is the same except relative to the first pulse having a step of 1. In between, we the step size interpolate linearly.