Hello I'm back with another update to my VOCALOID project. It's not as big an improvement as last time - and in fact, there's no new features - but I felt like it was worth posting. I've been trying to rectify the major issues before I move onto implementing the Excitation plus Resonance model.
The first thing I attempted to tackle was all the added noise at high frequencies.
Here's the original spectrum: https://files.catbox.moe/fq55bo.png
And here's the reconstructed spectrum (with no transforms applied): https://files.catbox.moe/gq7jff.png
You can clearly see the high frequency artifacts. The first thing I tried was something mentioned in the paper. In the paper, specifically the WBVPM section, it was mentioned that there are two approaches for a non-integer size discrete fourier transform. The first one is repeating the signal while second is upsampling it. I went with second as the former is patented and also because the second is easier to implement. It is mentioned that increasing the repetition count of the signal (or in the case of upsampling, the upsampling factor), and then discarding the higher frequencies, can improve the estimation by reducing artifacts. In the case of repetition, it is also mentioned that quadratic interpolation can be used in the resulting spectrum, however I am not sure if this can be done for upsampling and as such, I have not tried to implement it for now.
Here's the result after applying an upsampling factor of 3: https://files.catbox.moe/qcgnzq.png
Here's the original audio: https://files.catbox.moe/f7g8ta.wav
The original reconstruction: https://files.catbox.moe/da0m1i.wav
And now with the improved reconstruction: https://files.catbox.moe/513ycn.wav
You can see an improvement, especially at lower frequency, however the high frequency artifacts largely persist. So they have to be arising elsewhere. I realized the source was the reconstruction of the signal (AKA the "synthesis"). I had previously implemented a synthesis method that was quite different from the one used in the study, because I did not understand the method in the study at first. My synthesis method worked by taking each voice pulse and for each sample where the voice pulse is the closest voice pulse to that sample, setting the value of that sample to the interpolated value of a spline representing a time domain version of the upsampled voice pulse with a step corrospondin between the ratio a sample in the regular time domain and the upsampled time domain. Now, in some cases, estimation inaccuracies and differences from any transformations that were applied result in these regions of samples being bigger than the actual sample itself. In these cases, we take advantage of the period nature of the voice pulse and repeat it (i.e. sampling before the start is equivalent from that offset from the end, and sampling after the end is the same as that offset from the start). However, this method results in discontinuities in some cases.
Here is an example of such a discontinuity: https://files.catbox.moe/jnnxfj.png
I began to try to implement an interpolation system. In this system, we could calculate the gap between pulses - or in the cases of inaccuracies in the other direction (i.e. overlapping pulses) - the overlapping area, and interpolate between one pulse and the other linearly. However, this was approach was complicated significantly by the non-integer (and potentially differing) sizes of the pulses as well as numerous edge cases. For this reason, I struggled to do so and spent over an hour trying to figure out how to do it corrrectly. About half way through, I decided to check the paper again and this time I understood the actual synthesis method properly, largely because of a diagram I had missed the first time.
In the actual method, each pulse is is expanded in a manner similar to that of the border interpolation technique used in WBVPM analysis, except kind of in reverse. In this technique, for each voice pulse, we generate extensions on both sides with each extension having the size of the border interpolation ratio of the size of the voice pulse. Then we apply a trapezoidal window to the voice pulse which starts at zero at each side of the extended voice pulse and becomes 1 on either side after protrusion of twice the border interpolation size on each side. Then we overlap and add the voice pulses.