Conversational Kiosks

Imagine an installation, such as in a museum or public space. The installation includes some sort of kiosk or physical representative of one or more artificial entities. Each kiosk gives an embodiment -- a place for existence -- for verbal input and output to each entity. The kiosk contains loudspeaker and microphone (voice and ears).

A side effect of the kiosk design is protection of the microphone -- protection against wind noise and "mic-eating" behaviors -- protection against "mic-grabbing" behaviors -- protection against "is this thing on?" mic behaviors (tapping, blowing) -- protection against theft.

The microphone is connected to a multi-slot-filling ASR. An intermediate software application handles the mapping between input and output. This software uses a simple table. Speech grammars support semantic tags in the table. Strings of text to be spoken by the artificial voice reside in the table. The job of the software is to map ASR (speech input) onto TTS (speech output). The result is a "conversation." A visual display on the kiosk provides hints to the user on what to say. For example, the user might see:

- ask about the voice
- when were you born?
- say your favorite colour
- do you have a secret?
- if you can talk does that mean you're smart?
- am I real or am I a simulation?

A rich grammar of permutations recognizes a large number of these sentences in their various grammatical forms. The table then replies with the TTS (artificial speech) as parameterised by algorithmic and/or stochastic methods (e.g. quality of voice, rate of speech, degree of distortion, prosodic characteristics, etc). The machine speaks:

- that was a good day
- [male] this is a male voice [female voice] and this is a female voice. [both in unison] Which do you prefer?
- tell your secret first
- does that make sense?
- hmmm ... red, blue, yellow, and green ... they all keep coming up.
- if you were a simulation, how would you know?

Note that these are not a one-on-one mapping. Saying the same sentence produces different results each time. BUT (and this is the key), a motivated user who wants to pursue a conversation will be led easily down the primrose path. So this chatbot seems to hold philosophical discussions as long as the user sticks with the thread. (read "The Most Human Human" by Brian Christian).

Many users will not stick with a thread. They will experiment with silly, nasty, or humorous comments. Those, too, will generate spoken responses, some of them randomly appropriate, others irrelevant or non sequitur. All input triggers some kind of spoken output (excepting perhaps background noises or similar probably-should-be-rejected inputs).

If users follow a predicted thread, then the voice becomes increasingly human-like. If input by input speech has no pattern as represented in the table, then TTS parameters can be freely changed -- never during a sentence but from sentence to sentence. That includes timbral extremes, electronic processing, echo, extreme rates -- determined randomly as people experiment. If a user returns to sentient interaction, the speech becomes more narrowed in terms of its TTS characteristics.

There are several such kiosks. The installation is available to the public 24 x 7. At certain times, there is a formal performance.

For the performance, spectators can still experiment at the kiosks but actors are able to "play" the kiosks through various conversations. One actor's conversation with a kiosk dovetails with neighboring kiosks. Sometimes the input sentence is the same as a neighboring kiosk's output. The actors are able to pull the audience in and modulate the dialogue willfully. This improvisation gives the audience a sense of having power and control over language. But the concept of "sentient" becomes blurred and relative: smart accidents occur; dumb organized sequences occur.

Just some half-baked ideas -- 19 May, 2011, Bruce Balentine

Bruce Balentine
Posted 17th September 2011 at 5:19 PM