Ok I'm so so sorry for how late this response is but I literally just joined the site. I used to do a lot of UTAU, so I have a few tricks! To me, changing where I mentally broke the words up into phonemes was helpful. So, breaking something like "I'm a fish" into "ai-ma-f(i/e)shi" instead of "ai-mu-a-f(i/e)shi" and carrying consonants into the next word.
Also helpful was comparing how voiced/voiceless consonants work in the same spot. For example, in the word "letter", breaking it down as "re-d(e/a)" instead of "re-t(e/a)". Another thing is that, in American English at least (and I think UK English too) we don't pronounce ending consonants all the time, unless they're followed by a consonant. This means that "do-u-n" might be better than "do-u-n-to" for "don't" and "he-(i)-n" might be better than "he-(i)-n-gu" for hang. Also, different voice synths pronounce different vowels slightly differently, so if you play around with those you might get surprising results!
I'm pretty sure there are also fun things you can do with the pitch to imply certain sounds (like raising the pitch to imply an N sound) but I'm not familiar with the Vocaloid engine and don't know how helpful this would be.
I hope that this is still helpful to you! Good luck! :D