After the release of powerful Large Language Models (LLM) such as Copilot and ChatGPT, many of us were wondering whether indeed they are good at, not only coding, but modeling. There seem to be a good number of engineers believing that LLM came to stay and that they are going to revolutionize the way in which software is developed, but: Up to what extend can LLMs be used today to model software systems? While there are many papers devoted to analyzing the potential advantages and limitations of these generative AI models for writing code, the analysis of the current state of LLMs with respect to software modeling has received little attention so far.

In a recent paper [1] published open access in the Software and Systems Modeling journal, we have investigated the current capabilities of ChatGPT to perform modeling tasks and to assist modelers, while also trying to identify its main shortcomings.

TLDR: Our findings show that, in contrast to code generation, the performance of the current version of ChatGPT for software modeling is limited, with various syntactic and semantic deficiencies, lack of consistency in responses, and scalability issues. We believe that we should start working now to improve LLMs for software modeling.


Most experts foresee a major disruption in the way software is developed and software engineering education is also expected to drastically change with the advent of LLMs. These issues are a recurrent topic in many universities and are being covered by most specialized forums and blogs. A plethora of papers are now analyzing the potential advantages, limitations and failures of these models for writing code, as well as how programmers interact with them. Most studies seem to agree that LLMs do an excellent job in writing code: despite some minor syntactical errors, what they produce is essentially correct.

However, what about software modeling? What is the situation of LLMs when it comes to performing modeling tasks or assisting modelers to accomplish them? A few months ago, together with my colleagues Javier Cámara, Javier Troya and Antonio Vallecillo, we started looking at these issues, trying to investigate the current status of LLMs with respect to conceptual modeling. Our premise is that LLMs are here to stay. So, instead of ignoring them or rejecting their use, we posit that it would be better to embrace and use them in an effective manner to help us perform modeling tasks.

In our paper [1], we focused on how to build UML class diagrams enriched with OCL constraints using ChatGPT. To do so, we investigate several issues, such as:

  1. the correctness of the produced models;
  2. the best way to ask ChatGPT to build correct and complete software models;
  3. its coverage of different modeling concepts and mechanisms;
  4. its expressiveness and cross-modeling language translation capabilities; and
  5. its sensitivity to context and problem domains.


We conducted an experiment to understand the current capabilities of ChatGPT to perform modeling tasks. We defined two phases: (1) some exploratory tests to gain a basic understanding of how ChatGPT works with software models, as well as its main features and limitations; and (2) more systematic tests that aimed to further characterize ChatGPT’s modeling capabilities. All the details of the experiments, the materials and examples, and all our findings are listed in our paper.

We formulated eight research questions for which we wanted to find an answer. These are presented in the following.

Research questions and answers

RQ1.: Does ChatGPT generate syntactically correct UML models?

The UML models produced by ChatGPT are generally correct, although they may contain small syntactic errors, which depend on the notation used (PlantUML, USE, plain text, etc.). Although we did not test it thoroughly, the level of syntactic correctness of the models produced in PlantUML was much higher than those generated in USE, for example.

RQ2.: Does ChatGPT generate semantically correct models, i.e., semantically aligned with the user’s intent?

This is the weakest point that we observed during our interaction with ChatGPT. Some studies suggest that LLMs are better at syntax than producing semantically correct results, and our findings corroborate this fact. This includes errors in both the semantics of the language and the semantics of the domain being modeled. On many occasions, we observed that ChatGPT proposed seemingly random models that made no sense from either a modeling or domain standpoint.

RQ3.: How sensitive is ChatGPT to the context and to the problem domain?

Our findings show that not only the problem domain influences the resulting models, but also the information exchanged during the dialogues with ChatGPT. In addition, the more ChatGPT “knows” about a domain (i.e., the more data about a domain was used during training), the closer-to-correct class models it produces. ChatGPT produces its worst results when it has little or no information about the domain or the entities to be modeled, as it happened when asked to produce software models of entities such as Snarks or Zumbats, for which it did not seem to have any reference or semantic anchor.

RQ4.: How large are the models that ChatGPT is able to generate or handle?

ChatGPT currently has strict limitations on the size of the models it can handle. It has serious problems with models larger than 10–12 classes. Even the time and effort required to produce smaller models are not insignificant.

RQ5.: Which modeling concepts and mechanisms is ChatGPT able to effectively use?

We analized 16 modeling concepts. Some of them as simple as classes and attributes, and some of them more advanced such as OCL constraints and or association classes. We observed there is a high degree of variability in how ChatGPT handles them. ChatGPT is able to manage reasonably well (with some exceptions) associations, aggregations and compositions, simple inheritance and role names of association ends. However, it requires explicit indications for using enumerations, multiple inheritance and integrity constraints. Finally, we found out that its results are not acceptable when using abstraction, and it cannot handle association classes.

RQ6.: Does prompt variability impact the correctness/quality of the generated models?

We observed that there is plenty of variability when ChatGPT generates responses to same prompt. We learned that it is useful to start a new conversation from scratch when the results were not good, in order to find better solutions for the same intent model.

RQ7.: Do different use strategies (e.g., prompt partitioning) result in different outcomes?

First, the size of the models that ChatGPT is capable of handling in a single query forces the modeling task to become an iterative process in which the user starts with a small model and progressively adds details to it. The variability and randomness of ChatGPT responses or when results within a conversation start to diverge often force the modeler to repeat conversations to try to obtain better models.

RQ8.: How sensitive is ChatGPT to the UML notation used to represent the output models?

ChatGPT is capable of representing models with several notations, although in general it makes fewer syntactic mistakes with PlantUML. It is also much better with OCL than with UML. Finally, we also looked at how accurate ChatGPT was with cross-modeling language translation, realizing that this task works better within the same conversation, but not across conversations.

Discussion and Conclusions

From our study, we conclude that ChatGPT is not yet a reliable tool to perform modeling tasks. Does that mean we should discard it, or at least wait to see how it evolves before taking any action? Our position is that, on the contrary, we should start working now to strengthen the modeling skills of ChatGPT and other LLMs to come (including also their capabilities regarding the LLM understanding of other languages beyond UML), and to build a future where these assistants are destined to play a prominent role in modeling.

In our opinion, ChatGPT or any other LLM can be of invaluable help in many areas of MBSE, complementing the current work of software modelers and letting them focus on the tasks for which they really provide value.

The use of large language models has the potential to revolutionize software modeling engineering and education, making it more accessible, personalized and efficient. To get to that point, we will first need to improve the current consistency and reliability of the models produced by LLMs such as ChatGPT. Second, we will need to change the way in which we currently develop software models and teach modeling.

We encourage you to read out paper for more details and contact us if you would like to discuss!


[1] Javier Cámara, Javier Troya, Lola Burgueño, Antonio Vallecillo. On the assessment of generative AI in modeling tasks: an experience report with ChatGPT and UML. Softw Syst Model (2023). DOI: 10.1007/s10270-023-01105-5

Want to build better software faster?

Want to build better software faster?

Read about the latest trends on software modeling and low-code development

You have Successfully Subscribed!

Pin It on Pinterest

Share This