This post gathers the different works we have done during the last three years on improving the scalability of existing model-driven tools, especially regarding the manipulation of very large models. It is a summary of the content of my thesis that provides an overview of the concrete issues we have considered and the developed approaches we have proposed to tackle them. You can have a look at the online version of the manuscript, as well as the pointers to specific blog posts and articles provided for each solution.
Scalability for very large models: Problem Statement
While MDE pretended to be the silver bullet for software engineering, the growing use of large and complex models in industrial contexts has clearly emphasized serious limitations hampering its adoption. Existing empirical assessments from industrial companies adopting MDE have shown that the limited support for large model management in existing technical solutions is one of the main factors in the failure of industrial MDE processes.
Indeed, modeling tools were primarily designed to handle simple modeling activities, and existing technical solutions are not designed to scale to large models commonly used and automatically constructed nowadays. As an example, the BIM metamodel defines a rich set of concepts (around 800) that describes the different aspects of physical facilities and infrastructures. Instances of this metamodel are typically composed of millions of elements densely interconnected. The resulting models are stored in large monolithic files of several gigabytes, and cannot be processed efficiently by the current modeling infrastructures.
In our work, we focus on two major issues that have to be addressed in order to improve the scalability of existing technical solutions and enable industrial usage of MDE techniques applied to large models.
Issue 1: Model Storage Scalability
In the last decade, filed-based EXtensible Markup Language (XML) serialization has been the preferred format for storing and sharing models. While this format was a good fit to support simple modeling activities such as human model sketching, it has shown clear limitations when applied to nowadays industrial use cases, that typically manipulate large models, potentially automatically generated. Indeed, XML-like representation usually rely on large monolithic files that require being entirely parsed to be navigable, and provides limited support to partial loading and unloading of model fragments.
Several solutions based on relational and NoSQL databases have been proposed to address this issue, but they often focus on providing generic scalability improvements (e. g. lazy-loading strategies), and the choice of the data-store is totally decoupled of the expected model usage. As a result, a given solution can fit a specific modeling scenario, and be unadapted for another one. Furthermore, existing model persistence frameworks typically lack advanced caching and prefetching mechanisms that could be integrated to improve their performance.
Issue 2: Model Query and Transformation Scalability
Model queries and transformations are the cornerstones of MDE processes, and multiple approaches have been designed to compute them on top of existing modeling platforms. Model query and transformation frameworks typically provide a high-level language (such as the Object Constraint Language (OCL) OMG standard ) that is translated into sequences of modeling framework’s API calls and computed by the underlying data-store. While this query computation technique is efficient on top of XML-based serialization platforms (because the entire model has to be loaded in memory), it presents two major drawbacks when applied to current scalable persistence solutions: (i) the modeling framework APIs are not aligned with the query capabilities of the data-store, limiting its benefits, and (ii) an important time and memory overhead is necessary to reify intermediate objects that can be manipulated using these APIs.
In addition, current query and transformation solution typically store additional information in-memory (such as transformation traces), that grow accordingly to the model size and limit their performances when applied to large models.
Our contributions towards solving the performance and scalability issues of modeling tools
In order to solve these issues, we have proposed the novel scalable modeling ecosystem presented in Figure 1. It is composed of four components, each one addressing a specific issue.
- NeoEMF is our solution to improve the storage and access of large models. It is defined as a generic modeling framework that can be transparently plugged into the EMF platform, and provides a set of NoSQL database implementations that can be selected to suit a given modeling activity. NEOEMF is based on a modular architecture that can be complemented with additional model storage techniques, and provides extension mechanisms that are reused in the other contributions to further improve performances of model query and transformation computations. (blog post, article)
- PrefetchML is a DSL that allows modelers to define prefetching and caching instructions over a model. The resulting PREFETCHML plan is processed by an execution engine that takes care of loading and unloading elements, speeding-up
model accesses and query computation. Our approach aims to be generic, and can be applied on any persistence solution that provides an EMF compatible interface, and an advanced integration in NeoEMF has been proposed to further improve performances. (blog post, article)
- Mogwaï is a novel model query approach that generates NoSQL database instructions from high-level model queries expressed in OCL. Our solution relies on the rich database query languages that are provided by NoSQL databases (in particular graph implementations) to bypass the modeling stack limitations. Mogwaï is natively integrated in NEOEMF , and our experiments show a significant improvement in terms of execution time and memory consumption when compared to state of the art solutions. (blog post, article)
- Gremlin-ATL is an extension of the Mogwaï approach that supports model transformation expressed in ATL. Our approach embeds a novel transformation execution engine that can be parameterized to scale to large models by storing transformation information in a dedicated data-store, and provides a set of low-level connectors that allow computing transformations on heterogeneous data-sources. (blog post, article)
In addition, we have reused and combined the core MDE techniques introduced in this thesis into a solution dedicated to bridging the gap between conceptual modeling and NoSQL (especially graph databases) infrastructures. UmlToGraphDB (blog post, article) is a model-driven approach that translates conceptual schemas expressed using UML into a graph representation, and generates database-level queries from business rules and invariants defined using OCL. Our approach reuses the implicit model to database mapping embedded in NeoEMF, and integrates the Mogwaï framework to generate database queries ensuring data integrity.