Speech Ready Now
Speaker Interview — Kai-Fu Lee
Since Microsoft is about to start a big
push for speech technology,
we asked Microsoft Corporate Vice President Kai-Fu
Lee why Visual Studio
developers should add speech to their tool chest, and whether speech
is ready for prime time. Mr. Lee will follow Bill Gates onstage for
the opening address to VSLive! San Francisco, March 23-27.
How big a role will speech technology have in IT applications in the
near term, and what makes you confident of that?
I see enormous potential for speech technology to add value to IT applications
in the near term. Speech and IT are a natural fit, because when you’re
having IT trouble, often Web-based self-service doesn’t work,
and speech interaction on the telephone is the only alternative. One
example is password reset. More than 30 percent of the support incidents
and time spent by the IT help desk involves users requesting that their
password be reset because it has expired, been lost or forgotten.
A
speech application that enables users to simply call on the phone
and, via the automated speech system, reset their password, can result
in
tremendous cost and productivity savings for the IT department,
as well as provide a positive (and positively satisfying) experience
for the
end-user.
I’m confident of the impact of speech on IT applications because
speech technology has sufficiently advanced to the point where it is
robust and accurate enough for mainstream applications. Another reason
is that speech telephony applications and Web applications are converging — meaning
that an IT shop can run one integrated, unified application for both
speech and Web applications and deploy this on its existing Web infrastructure.
This fits nicely into the IT mandate to cost-effectively leverage existing
IT assets and extend them with new investments, such as a speech, to
gain new functionality without rip-and-replacing.
What advances have occurred to make speech ready for prime time?
First, there have been tremendous technological advances in the areas
of speech recognition and synthesis, statistical modeling, and noise
robustness. Every year, the speech-recognition error rate is reduced
by 10-15 percent. At this rate, for close-talking dictation, machines
will reach human performance in about seven to eight years.
Another significant advance is the emergence of open platforms and
standards. Customers have demanded that speech technologies and the
related interactive voice response (IVR) platforms move away from proprietary,
incompatible standards and adopt open W3C specifications and standards.
One example of an open standard driving speech to primetime is SALT
(Speech Application Language Tags). Open standards will help drive down
cost and improve interoperability and re-usable software.
This will
help bring speech to the mainstream. In addition, the release
of Microsoft Speech Server 2004 will provide a common and standard platform
for application
developers to coalesce around as opposed to the proprietary,
fragmented and somewhat confusing platform market that exists today.
Microsoft
Speech Server will enable the creation of packaged speech applications
that can be developed and delivered by the mainstream community
using Visual Studio .NET.
I previously mentioned the convergence of telephony and Web technologies,
which is another driver that is advancing the market and making speech
ready for prime time. Finally, the cost of speech solutions is dropping
to the point where not only large enterprises will find speech affordable,
but medium-size businesses will find speech a cost-effective technology
to implement. Microsoft Speech Server is leading the way in enabling
flexible and integrated speech solutions at the lowest total cost of
ownership.
How big a leap is it for a skilled Visual Studio developer
to add speech to his or her résumé? It’s not a big leap at all, but as with any new technology there
is a learning curve. However, we’ve provided the tools that make
speech development for Visual Studio developers faster and easier
than ever before. Our Microsoft Speech Application Software Development
Kit
(SDK) integrates into Visual Studio .NET and, with the controls
and tools provided, enables programmers to add speech into their Web
applications
using the standard programming paradigm they use for any other
Web application.So if the developers know Web programming, uses objects
and events, and a little scripting, they can easily use our tools for
speech-enabled
Web application development.
Speech adds another level of user
interface design, however, sometimes called VUI (voice user interface).
So just
as developers learned GUI (graphical user interface) best practices
over time, they will need to learn VUI best practices as well.
We’ve
made that job easier for Visual Studio .NET developers through both
our pre-built asp.net controls included in the SDK, which encapsulate
the VUI design within the control. In addition we have made available
instructor-led training courses that provide the necessary VUI design
knowledge for the developer.
A company called Vertigo Software is a
great example of a Visual Studio .NET development shop that
didn’t
know anything about speech, yet within a few weeks was able to
use our tools to build and deploy some excellent reference speech
applications.
In the January/February issue of Speech Technology Magazine,
Vertigo details its experience using the Microsoft Speech Application
SDK to
build speech-enabled Web applications.
Where do you expect
speech to have the biggest impact — the data
center, mobile devices, desktop clients — and why? Speech technologies are currently gaining tremendous adoption in the
data center, primarily for customer self-service applications in telephony.
The use of speech in the data center is driven primarily by telephones
and cell phones. The data center approach allows a scalable, manageable
and reliable server-based deployment for speech applications capable
of supporting hundreds or thousand of simultaneous speech-enabled phone
calls. So in the short term, the data center will be the arena in which
speech has the biggest impact. In the medium to long term, mobile screen-based
devices such as Pocket PCs and Smartphones will accrue the value of
speech technology. We call this multimodal, or the mixture of speech
and visual input/output.
Imagine calling your financial company and,
speaking into your speech-enabled mobile device, asking for an
update on your stock portfolio, or calling your travel company via its
automated
system for a listing of upcoming airline flights from Boston
to New York. Your speech request will be answered with a GUI-based display
of your stocks or the listing of flights. We believe the proliferation
of mobile devices, and the limitations they naturally have for
easy
input, also will benefit from speech technology.
The desktop
currently provides much easier input modalities, such as the
keyboard and mouse,
so the use of speech on the desktop is somewhat more limited.
However, we see value in speech on the desktop via dictation
capabilities and
command and control of applications, particularly for people
with disabilities and people who are slow typists. But as I mentioned
earlier, with 10-15
percent improvement every year, speech on the desktop will
be faster than typing in just a few years.
Finally, when human-level
accuracy
is possible, speech will be pervasive, and it will lead to
a new kind
of user interface: a delegation interface where you no longer
tell the computer the steps to do something, but just the goal
you want to accomplish —and
the computer figures out the rest! That may be 10 years away, but it
will completely revolutionize the human/machine interface and change
the way we interact with every device.
|