Improve text normalize to keep original timestamps#264
Improve text normalize to keep original timestamps#264fondoger wants to merge 5 commits intoremsky:masterfrom
Conversation
|
@remsky, @fireblade2534 Please review this PR. I tested it locally and the result is good. |
|
I can't test it out right now but ill test it out tmrw. |
… phenomizer that the rest of the text uses.
There was a problem hiding this comment.
This PR looks great in concept but there are a few issue texts that I want to highlight:
- Running on localhost:7860 -> Running on [localhost:[7860](/sˈɛvənti ˈeɪt sˈɪksti/)](/lˈoʊkɐlhˌoʊst kˈoʊlən sˈɛvən θˈaʊzənd ˈeɪt hˈʌndɹɪd sˈɪksti/)
- Email me at user@example.com -> Email me at [user@[example-com](/ɛɡzˈæmpəl dˈɑːt kˈɑːm/)](/jˈuːzɚɹ æɾ ɛɡzˈæmpəl dˈɑːt kˈɑːm/)
- Oh yeah I have $500.60 in my bank account -> Oh ye'a I have [$[500.60](/fˈaɪv hˈʌndɹɪd pˈɔɪnt sˈɪks zˈiəɹoʊ/)](/fˈaɪv hˈʌndɹɪd ænd wˈʌn dˈɑːlɚz ænd sˈɪksti sˈɛnts/) in my bank account
What happens with both of those (and will happen in more cases) is that it normalized for example localhost:7860 but since the text was still in [localhost:7860] the number normalizer came along and normalized the number. This is an inherent issue because of the way that the normalizer / you code work. The code does handle custom phonemes, see text_processor.py:handle_custom_phonemes and get_sentence_info.
|
Thanks for the review. I'll check if I can think of better solutions to handle these cases. |
|
Just find out that the original Kokoro itself can already handle some basic normalizations. Try it here: https://hexgrad-kokoro-tts.hf.space
Maybe we can simply disable normalizations in Kokoro Fast API. |
|
Disabling normalizations in kokoro-FastAPI has always been an option. The readme has a section on how to do it |
I would suggest hijacking the current system for preserving custom phenomes |
Currently, the text normalize algorithm will simply replace original text with normalized text. This behavior causes the generated timestamps not align with the original timestamps.
Kokoro supports embedding phonemes in the text, and the token timestamps is based on the original text.
[Misaki](/misˈɑki/) is a G2P engine designed for [Kokoro](/kˈOkəɹO/) models.Misaki is a G2P engine designed for Kokoro models.Before this PR:
Note that
$100is mistakenly shown asone handred, and9:30PMis shown asnine thirtyPMAfter this PR:
Note that both the
$100and9:30PMis correct now.