Alarming synthesized speech tells you when to freak out

Fujitsu technology could work with cloud data to tailor synthesized speech to a given situation

Comments

Fujitsu has developed technology that makes synthesized speech sound a little more natural and less robotic by adopting the appropriate tone for different situations.

For instance, if there's an emergency it would select an alarming tone of voice. In a noisy environment, it would choose to speak louder and more clearly. A tranquil environment might elicit a relaxing tone.

To create the voice, the technology uses machine learning algorithms to analyze the patterns of natural speech and extract voice characteristics relatively quickly from a small voice sample. From those characteristics, a high-quality synthetic voice can be built and imbued with the right tone for a given situation, Fujitsu said.

The synthesis technology could work with local or cloud data so that a voice system could give out information with the appropriate tone and pitch. It might inform factory workers of a valve problem, for example, by adopting an urgent tone of voice.

An audio sample in Japanese presents a synthesized voice repeating a warning about a duct error with increasing urgency, rising in pitch and speed.

In a country like Japan, where everything from escalators to trucks issue automated voice warnings to users or people nearby, making synthetic voice systems aware of environmental and other data could make them more effective. But the technology could be used wherever synthetic voices are needed.

"Some examples of potential applications for this technology are voice-based work support solutions in factories and other work environments, natural disaster-related broadcast solutions, car navigation systems, text-to-speech services for online content, text-to-speech email solutions and automated messaging services," a Fujitsu spokesman said.

The know-how might also lead to applications that clone a person's natural voice, which would be useful for those who are losing their ability to speak due to illness.

It would take about three to five hours of someone reading out a text to create a database, the spokesman said, after which the person's voice could be cloned. If the user were to lose the ability to speak, a PC could speak typed words with the cloned voice.

Before he died in 2013, movie critic Roger Ebert, who had lost his voice to cancer, used a system developed by Scottish text-to-speech company CereProc to recreate his voice based on DVD film commentaries he had recorded.

Fujitsu plans to commercialize its voice synthesis technology in its fiscal year that ends in March 2015.

It is also looking at the possibility of expanding the know-how to be used as a cloud service, the spokesman added.