Verbal and nonverbal discourse planning Berardina De Carolis Catherine Pelachaud Isabella Poggi University of Bari University of Rome La Sapienza University of Rome Three Department of Informatics Department of Computer Department of Linguistics Intelligent Interfaces and System Science 1 Introduction We are designing a Conversational Agent that communicates at the same time by speech and expressive face. As we are engaged in face-to-face interaction or in any communicative acts where the visual modality is available along with the acoustic one, multimodal signals are at work. Our communicative acts are performed not only through words, but also through intonation, body posture, hand gestures, gaze patterns, facial expressions and so on. Thus, an important step is to define how the Agent s communicative acts are expressed into a coordinated, either sequential or simultaneous, verbal and nonverbal message. In making an Autonomous Agent capable of communicative and expressive behavior, then, a relevant problem to be taken into account is how the Agent plans not only what to communicate, but also by what (verbal or nonverbal) signals, in what combination and how synchronized. This depends on several factors : (i) consideration of the available modalities (e.g., face, gaze, voice), (ii) cognitive ease of production and processing of signals (for example; in describing an object, a gesture may be more expressive than a word); (iii) expressivity of each signal in communicating specific meanings (for example: emotions are better told by facial expression than by words); (iv) appropriateness of signals to social situations (for example: an insulting word may be less easily persecuted than a scornful gaze). Finally, metacommunicative constraints may occur that, say, lead to use redundancy when information to convey is particularly important or needs to be particularly clear: this brings to use verbal and nonverbal signals at the same time. Since few years, multimodal conversational systems [2, 3, 5, 18, 1] have been proposed. These systems integrate verbal and nonverbal signals such as deictic gestures, gaze behavior, communicative facial expressions. Cassell and Stone [4] have designed a multimodal manager whose role is to supervise the distribution of behaviors across the several channels (verbal, head, hand, face, body and gaze). Our system is close to this last one. But we are also considering the context in which the conversation takes place. In this paper we first describe our enriched discourse generator explaining the 2 sets of rules (trigger and regulation) we have added. We also review the different types of gaze communicative acts. Finally we present the variables defining the context and how they modify the computation of the display of the communicative acts. 2 Discourse Generator In order to achieve our aim, we developed a generator whose architecture is shown in Figure 1. This architecture is inspired to classical NLG systems [11, 13], with some differences that are due to the need to consider the two following factors: to keep the content level independent of the way in which it is out-putted (media independence) and resources are distributed; to include emotional and personality factors in the discourse plan. Starting from a given communicative goal, a hierarchical planner [7, 6] builds a discourse plan-tree, by using a pool of plan operators and according to constraints on the domain and on the mental states of the Sender and Addressee. The resulting plan is a hierarchy of goals and subgoals down to the leaves, that represents the primitive communicative acts (Inform, Request, etc.). Each node of this tree brings information about the communicative goal, the role the node plays in the Rhetorical Relation (RR) [13] attached to its fathernode, the RR that relates the two subtrees departing from it and the focus of the discourse. Being media independent, this plan is suitable to be translated into a written text or speech. However, at this level
Plan Operators Trigger rules Regulation rules PLANNER D-Plan Goal-Media Prioritizer Enriched Plan Surface realizator Communicative Goal Domain KB Context model Media-Instantiated Discourse Body Instantiator Face 2D character Written text... Figure 1: System architecture only the referential content of the presentation is represented: information on the emotional and cognitive state, that should be expressed in communicating with the User (the Addressee of the conversation), is missing. 2.1 Nonverbal communicative acts Emotional and cognitive information very often pass through nonverbal communication. In previous works [15, 12, 17], we have drawn a semantic typology of the types of information that we can in principle convey while communicating. If we focus only on gaze expressions (but of course, all of these types of information can be conveyed also through verbal behavior or, in many cases, by gestures or facial expressions), we have shown that gaze can make reference to concrete or abstract entities present in the physical context or mentioned in discourse (that is, it has a deictic function); it can mention some concrete or metaphorical properties (like huge, small or subtle ); it can express emotions (fear, anger, dismay...), performatives (it can implore, defy...), conversational information (sentence focus, turn-taking requests); and it can also express metacognitive information (I am thinking, I am trying to remember), and finally Rhetorical Relations within a discourse (contrast, elaboration...) [13]. What we want to focus on here is that RRs are only a (small) subset of the information we convey while talking. The generation of texts must then be enriched with all these other types of information. 2.2 Goal-media prioritizer To associate verbal information with a combination of non-verbal expressions (for instance a gaze, an eyebrow expression, a gesture etc), we need to select the expressions to be shown and to decide, based on their priority, if and when it should be displayed. This is the task of the goal-media prioritizer, a notion first introduced by [4]: this module revises the plan by enriching its nodes with information about the type of non-verbal signals to employ at each stage of the conversation, their combination and their synchronization with verbal communication. The goal-media prioritizer enriches the original plan-tree by applying two sets of rules: trigger and regulation rules. The first set is employed to fire a particular signal, based on information to display, on the domain and the context. The second one is employed to decide whether to display or not a signal
and with which intensity [8]. The left parts of trigger rules formalise domain and context dependent conditions; their right parts establish the emotional, meta-cognitive, or other kinds of signals to convey. The triggering mechanism has been built on the basis of the emotion category theory of Ortony, Clore and Collins [14], revised by Elliott [9, 10]. The general structure of this type of rules is the following: IF DC-Cond THEN (Signal Ag e), where DC-Cond is a condition on the context and/or on the domain, Signal represents the type of triggered signal (i.e. feeling of emotions, meta-cognitive, deictic reference and so on), Ag is the agent that is conveying the signal and e is the object of the Signal. For instance: Given a node in the discourse plan that corresponds to the communicative goal of describing an event (Describe(S, A, event)), whose focus is event, two different trigger rules may be applied to deal with pleasant or unpleasant events: IF (Focus(node) is unpleasant ) THEN (Feel S Sadness) IF (focus(node) is pleasant ) THEN (Feel S Joy) 2.3 Context Obviously, the expression of a particular emotional state may depend on other factors, that influence the way in which that emotion is signalled, its expression intensity (degree), and so on. As we mentioned before, this process is governed by the leakage rules, that take into account factors like the Sender s and the Addressee s personality traits, their social relationship and so on. Therefore contextual information relevant to this decision includes, in our view: 1. Sender s display goal - what S (Sender) wants A (Addressee) to do (to console, to help, to advice..) while expressing one s emotion. 2. the Social situation - either public or intimate; 3. the Addressee s model - what the Sender thinks of the Addressee, that includes in its turn: (a) A s cognitive capacity i. Understanding capacity (S will not share emotions that A cannot understand); ii. Problem solving capacity (S will not show sad and ask help if A cannot give good advice) iii. Experience (If A never felt similar emotions, he can t help S) (b) A s practical resources (if the reason I want to show I am in love is to ask my friend his garçonnière, I will not display it if he has no garçonnière) (c) A s personality, where we include [16]: i. A s typical goals (I will not show sad to a selfish person, who gives more importance to his own than to other people s goal) ii. A s typical emotions (I will not show happy to an envious friend; I will not show afraid to someone who is apprehensive). (d) the social relationship between S and A, namely i. role and power relationship (I will not show contempt to one who can retaliate; I won t show angry at my boss) ii. social attitude - whether helping or aggressive to each other (I ll not show angry at my child if she cut a precious suit to make me a present). An example of how context determine the display or not of the display of fear is the following: IF (Feel S Fear) and(bels (Apprehensive A)) THEN(GoalS( Display S Fear)) S will not show afraid if she does not want that an apprehensive person A gets worried. IF (Feel S Fear) and (Goal S (Help A S)) THEN (Goal S (Display S Fear)) S will display her fear if she wants to get help from A.
(Describe S A Stay(jail)) ElabObjAttr (Inform S A Stay(jail)) (Inform S A LengthofStay(jail)) D-Plan Portion (Display S A Distress) (Inform S A Stay(jail)) (Display S A "I m thinking") ElabObjAttr Enriched Plan (Display S A Adjectival) (Inform S A LengthofStay(jail)) Figure 2: Example of enriched plan 2.4 Enriched plan At the end of this process, a new enriched plan is generated: this is the input of the surface realizer which, according to the type of body that has been selected for the Animated Agent, defines a mediainstantiated presentation. Figure 2 shows an example of the generation process, by illustrating how the result of the planner is enriched and how this new plan is transformed into an annotated structure corresponding to the mediainstantiated presentation. In this example, the communicative goal of the Sender is to describe to A her stay in jail. Jail in the domain knowledge base is associated to distress. To describe her stay in jail the Sender has to inform that she stayed in jail and how long she stayed in jail. But at the same time she remembered how distressful it was to stay in jail and how long it was. In order to maintain the independence and the possible distribution of all these components, the result of enriched plan is annotated with an XML-based language. In particular, the media instantiated presentation, for the same example as above, will be represented as: <gaze type= ImThinking > I have been in <gaze type= Distress > jail </gaze> for a very <gaze type= LargeAdjectival > long </gaze> period </gaze> The gaze type refers to gaze semantic topology proposed in [17]. 3 Conclusion In this paper we have proposed a new approach to generate verbal and nonverbal behavior for an embodied agent. As we generate the discourse we consider the context. The context gathers information on the situation in which the conversation takes place, the Addressee s model and the social relationship between the Sender and the Addressee. We claim that the consideration of the context allows us to build a more personalized and individual agent with a less robotic and repetitive behavior. References [1] E. André, T. Rist, and J. Mueller. Integrating reactive and scripted behaviors in a life-like presentation agent. In Proceedings of the second International Conference on Autonomous Agents, pages 261 268, 1998. [2] J. Cassell, J. Bickmore, M. Billinghurst, L. Campbell, K. Chang, H. Vilhjálmsson, and H. Yan. Embodiment in conversational interfaces: Rea. In CHI 99, pages 520 527, Pittsburgh, PA, 1999. [3] J. Cassell, C. Pelachaud, N.I. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone. Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In Computer Graphics Proceedings, Annual Conference Series, pages 413 420. ACM SIGGRAPH, 1994. [4] J. Cassell and M. Stone. Living hand and mouth. Psychological theories about speech and gestures in interactive dialogue systems. In AAAI99 Fall Symposium on Psychological Models of Communication in Collaborative Systems, 1999.
[5] E. Churchill, S. Prevost, T. Bickmore, P. Hodgson, T. Sullivan, and L. Cook. Design issues for situated conversational characters. In WECC 98, The First Workshop on Embodied Conversational Characters, October 1998. [6] B. De Carolis, F. de Rosis, D.C. Berry, and I. Michas. Evaluating plan-based hypermedia generation. In Proceedings of the 7th European Workshop on Natural Language Generation, Toulouse, France, 1999. [7] B. De Carolis, F. de Rosis, F. Grasso, A. Rossiello, D.C. Berry, and T. Gillie. Generating recipientcentered explanations about drug prescription. Artificial Intelligence in Medicine, 8:123 145, 1996. [8] P. Ekman. Emotion in the human face. Cambridge University Press, 1982. [9] C. Elliott. A affective reasoner: A process model of emotions in a multiagent system. PhD thesis, Northwestern University, The Institute for the Learning Sciences, 1992. Technical Report No. 32. [10] C. Elliott and G. Siegle. Variables influencing the intensity of simulated affective states. In AAAI technical report for the Spring Symposium on Reasoning about Mental States: Formal Theories and Applications, pages 58 67, Stanford University, March 23-25 1993. AAAI. [11] B.J. Grosz and C.L. Sidner. Attention, intentions, and the structure of discourse. Computational Linguistics, 12(3), 1986. [12] E. Magno-Caldognetto and I. Poggi. Micro- and macro-bimodality. In C. Benoit and R. Campbell, editors, Proceedings of ESCA, AVSP 97 workshop, Rhodes, Greece, September 1997. [13] W.C. Mann, C.M.I.M. Matthiessen, and S. Thompson. Rhetorical structure theory and text analysis. Technical Report 89-242, ISI Research, 1989. [14] A. Ortony, G. L. Clore, and A. Collins. The Cognitive Structure of Emotions. Cambridge University Press, 1988. [15] I. Poggi. Mind markers. In 5th International Pragmatics Conference, Mexico City, July 5-9 1996. [16] I. Poggi and C. Pelachaud. Facial performative in a conversational system. In S. Prevost J. Cassell, J. Sullivan and E. Churchill, editors, Embodied Conversational Characters. MITpress, Cambridge, MA, 2000. [17] I. Poggi, C. Pelachaud, and F. de Rosis. Eye communication in a conversational 3d synthetic agent. Special Issue on Behavior Planning for Life-Like Characters and Avatars of AI Communications, 2000. [18] J. Rickel and W.L. Johnson. Animated agents for procedural training in virtual reality: Perception, cognition, and motor control. Applied Artificial Intelligence, 13:343 382, 1999.