Multimodal systems are quickly becoming the new norm in software development. Whether it’s chatbots that combine voice and text, home automation systems that react to sound and movement, or smart assistants processing images, gestures, and speech simultaneously, modern software systems rarely rely on a single input modality or channel anymore.

This opens fascinating opportunities for richer applications but also brings up a huge challenge for requirements engineering: how to precisely specify requirements for these multimodal systems that depend on multiple input modalities at once?

Traditional requirements languages and tools fall short here, as they don’t offer specific constructs or templates to gather properly the specificities of this type of requirements, leaving developers at risk of implementing AI-powered features that don’t truly meet the user needs.

In this paper, we propose MERLAN (Multimodal Environment Requirements Language), a Domain-Specific Language (DSL) designed to formalize multimodal requirements in a precise, technology and platform-independent way.

In this post, we’ll walk through the key ideas of the paper, from the motivation and running example to the DSL’s design, syntax, and tool support, plus the future research roadmap we are currently considering.

Why Do We Need a DSL for Multimodal Requirements?

With the explosion of Machine Learning and other AI techniques, software systems are quickly adopting new types of complex user interfaces that require processing new input modalities such as text, audio and images. Sometimes, more than one type at the same time. This type of interfaces are called Multimodal User Interfaces (MUIs).

While building this type of AI-enhanced systems is becoming easier thanks to the constant influx of, for instance, new multimodal Large Language Models (LLMs) that facilitate the analysis of multimodal inputs, validating that the system is satisfying the actual needs of the user is becoming more and more complex

Indeed, we are missing proper requirements engineering languages and techniques to facilitate a precise specification of:

  1. the MUI conditions that should trigger a system response,
  2. the data (“entities” in the MUI terminology) from the multimodal input that should be collected to provide an adequate response and
  3. the actual response

Preliminary work in requirements for chatbots focused on intents (user goals) and entities (extracted parameters). But those efforts are limited to textual inputs. For multimodal environments, where audio, video, and other signals are in play simultaneously, there is no established language to define conditions precisely.

This is the gap that MERLAN aims to fill.

A Motivating Example: The Smart House Agent

To illustrate our approach, let’s describe a potential house automation and security system—a “house agent.”

The house agent has input devices for text, sound, video, temperature, light, and movement, plus output channels like text, audio, and predefined actions (calling the police, turning on lights, sounding an alarm).

The challenge is to specify the requirements that define when and how the system should react. For example:

  1. If smoke is detected → notify the owner.
  2. If fire is detected OR the house is empty AND a car or person is detected → trigger alarm, notify owner, and call the police.
  3. If a strong sound is detected during the night → turn on the lights.
  4. If a car with an unrecognized license plate is detected → notify the owner.

These rules involve both concrete entities (smoke, fire, person, car) and abstract entities (night, empty house). They also combine different modalities—audio, image, movement.

While you could write them down informally, that wouldn’t be enough to automatically generate a working system. MERLAN provides a precise DSL to express such requirements.

The Design of MERLAN

Like any DSL, MERLAN is defined by:

  • An abstract syntax (metamodel) that captures the core concepts.
  • A concrete syntax (notation) that lets users write requirements.

Metamodel elements

The following figure illustrates the main elements of the DSL emtamodel. We next give a short description of some of them but refer to the full work for more details. Merlan metamodel

  1. Multimodal Requirements
    • Simple requirements: rules that involve a single entity (e.g., “detect smoke in an image with confidence ≥ 0.5”).
    • Complex requirements: compositions of simple ones using Boolean operators (AND, OR, NOT).
  2. Entities
    • Concrete entities: physical objects like “person” or “car.”
    • Abstract entities: inferred concepts like “night” or “empty house.”
    • Entities can have attributes (e.g., a car has model, color, license plate).
    • Attributes can be fixed or left empty to be filled dynamically during recognition.
  3. Modalities
    Each requirement specifies the modality (image, audio, text, etc.) to be used when evaluating the entity.
  4. Cardinalities
    Inspired by UML, MERLAN allows defining quantities: exactly one, ranges (e.g., [1..*]), etc.

 

Concrete Syntax: The MERLAN Grammar

MERLAN’s concrete syntax is textual and implemented using ANTLR.

The grammar defines the syntactic rules allowed in MERLAN language, following the guidelines of the previous metamodel. For instance, at the first level, the script rule indicates that entities and requirements can be defined

Merlan grammar

Some of the main rules for the MERLAN grammar

 

Based on this grammar, we could express the previous house agent example as follows:

ENTITIES:
CONCRETE:
person
 - gender: ?
 - ethnicity: ?
smoke
fire
car:
 - model: ?
 - color: ?

ABSTRACT:
night
 - description: "The image is taken at night"
empty_house
 - description: "The house is empty"

REQUIREMENTS:
requirement1:
 CONCRETE
 - entity: smoke
 - name: "smoke"
 - modality: "image"
 - confidence: 0.5

requirement2:
 OR
  CONCRETE
   - entity: fire
   - name: "fire"
   - modality: "image"
   - confidence: 0.5
 AND
  ABSTRACT
   - entity: empty_house
   - name: "empty_house"
   - modality: "image"
   - confidence: 0.3
 OR
  CONCRETE [1..*]
   - entity: person
   - name: "unknown_person"
   - modality: "image"
   - confidence: 0.7
   - gender: "male"

The first code block, identified with the ENTITIES keyword, contains all the entity definitions following the grammar rules. In this example, there are entities without attributes (see smoke and fire), an entity dog with an attribute with specific value (breed: “labrador”) and other entities (person and car) with attributes with empty values. The requirements block, under the REQUIREMENTS keyword, contains the 2 example requirements Requirement2 contains a composition of complex requirements where 2 of the inner simple requirements define cardinalities of [1..*] (i.e., minimum 1 instance). The requirement referencing the person entity shows how to set an entity attribute’s value at the requirement level (see the gender: “male” attribute).

Tool Support and Prototype

We have implemented and made available Merlan:

The prototype takes MERLAN specifications and generates Python code where requirements are mapped into agent triggers. The framework integrates LLMs and Computer Vision models to process multimodal inputs in real time.

This way, requirements engineers can move directly from formal multimodal specifications to a working prototype agent.

Conclusions and Research Roadmap

The current version of MERLAN is a first step in the precise definition and implementation of multimodal software systems, but there is still a lot to do:

  • Graphical notation – making requirements easier to model visually, even with examples (e.g., providing an image scenario).
  • Behavioral requirements – extending beyond conditions to specify multimodal system responses.
  • Temporal constraints – handling requirements involving time (e.g., an object must persist for 5 seconds before action).
  • Hierarchical modalities – deriving high-level modalities (gestures, emotions) from low-level ones (images, audio).
  • Quality analysis – detecting inconsistencies or conflicts in multimodal requirements.

Happy to discuss with you about any of these directions

Pin It on Pinterest

Share This