Multimodal human-computer interaction refers to the "interaction with the virtual and physical environment through natural modes of communication", This implies that multimodal interaction enables a more free and natural communication, interfacing users with automated systems in both input and output. Specifically, multimodal systems can offer a flexible, efficient and usable environment allowing users to interact through input modalities, such as speech, handwriting, hand gesture and gaze, and to receive information by the system through output modalities, such as speech synthesis, smart graphics and others modalities, opportunely combined. Then a multimodal system has to recognize the inputs from the different modalities combining them according to temporal and contextual constraints in order to allow their interpretation. This process is known as multimodal fusion, and it is the object of several research works from nineties to now. The fused inputs are interpreted by the system. Naturalness and flexibility can produce more than one interpretation for each different modality (channel) and for their simultaneous use, and they consequently can produce multimodal ambiguity generally due to imprecision, noises or other similar factors. For solving ambiguities, several methods have been proposed. Finally the system returns to the user outputs through the various modal channels (disaggregated) arranged according to a consistent feedback (fission). The pervasive use of mobile devices, sensors and web technologies can offer adequate computational resources to manage the complexity implied by the multimodal interaction. "Using cloud for involving shared computational resources in managing the complexity of multimodal interaction represents an opportunity. In fact, cloud computing allows delivering shared scalable, configurable computing resources that can be dynamically and automatically provisioned and released".