This post gathers the different works we have done during the last three years on improving the scalability of existing model-driven tools, especially regarding the manipulation of very large models. It is a summary of the content of my thesis that provides an overview of the concrete issues we have considered and the developed approaches we have proposed to tackle them. You can have a look at the online version of the manuscript, as well as the pointers to specific blog posts and articles provided for each solution.
Scalability for very large models: Problem Statement
While MDE pretended to be the silver bullet for software engineering, the growing use of large and complex models in industrial contexts has clearly emphasized serious limitations hampering its adoption. Existing empirical assessments from industrial companies adopting MDE have shown that the limited support for large model management in existing technical solutions is one of the main factors in the failure of industrial MDE processes.
Indeed, modeling tools were primarily designed to handle simple modeling activities, and existing technical solutions are not designed to scale to large models commonly used and automatically constructed nowadays. As an example, the BIM metamodel defines a rich set of concepts (around 800) that describes the different aspects of physical facilities and infrastructures. Instances of this metamodel are typically composed of millions of elements densely interconnected. The resulting models are stored in large monolithic files of several gigabytes, and cannot be processed efficiently by the current modeling infrastructures.
In our work, we focus on two major issues that have to be addressed in order to improve the scalability of existing technical solutions and enable industrial usage of MDE techniques applied to large models.
Issue 1: Model Storage Scalability
In the last decade, filed-based EXtensible Markup Language (XML) serialization has been the preferred format for storing and sharing models. While this format was a good fit to support simple modeling activities such as human model sketching, it has shown clear limitations when applied to nowadays industrial use cases, that typically manipulate large models, potentially automatically generated. Indeed, XML-like representation usually rely on large monolithic files that require being entirely parsed to be navigable, and provides limited support to partial loading and unloading of model fragments.
Several solutions based on relational and NoSQL databases have been proposed to address this issue, but they often focus on providing generic scalability improvements (e. g. lazy-loading strategies), and the choice of the data-store is totally decoupled of the expected model usage. As a result, a given solution can fit a specific modeling scenario, and be unadapted for another one. Furthermore, existing model persistence frameworks typically lack advanced caching and prefetching mechanisms that could be integrated to improve their performance.
Issue 2: Model Query and Transformation Scalability
Model queries and transformations are the cornerstones of MDE processes, and multiple approaches have been designed to compute them on top of existing modeling platforms. Model query and transformation frameworks typically provide a high-level language (such as the Object Constraint Language (OCL) OMG standard [83]) that is translated into sequences of modeling framework’s API calls and computed by the underlying data-store. While this query computation technique is efficient on top of XML-based serialization platforms (because the entire model has to be loaded in memory), it presents two major drawbacks when applied to current scalable persistence solutions: (i) the modeling framework APIs are not aligned with the query capabilities of the data-store, limiting its benefits, and (ii) an important time and memory overhead is necessary to reify intermediate objects that can be manipulated using these APIs.
In addition, current query and transformation solution typically store additional information in-memory (such as transformation traces), that grow accordingly to the model size and limit their performances when applied to large models.
Our contributions towards solving the performance and scalability issues of modeling tools
In order to solve these issues, we have proposed the novel scalable modeling ecosystem presented in Figure 1. It is composed of four components, each one addressing a specific issue.
- NeoEMF is our solution to improve the storage and access of large models. It is defined as a generic modeling framework that can be transparently plugged into the EMF platform, and provides a set of NoSQL database implementations that can be selected to suit a given modeling activity. NEOEMF is based on a modular architecture that can be complemented with additional model storage techniques, and provides extension mechanisms that are reused in the other contributions to further improve performances of model query and transformation computations. (blog post, article)
- PrefetchML is a DSL that allows modelers to define prefetching and caching instructions over a model. The resulting PREFETCHML plan is processed by an execution engine that takes care of loading and unloading elements, speeding-up
model accesses and query computation. Our approach aims to be generic, and can be applied on any persistence solution that provides an EMF compatible interface, and an advanced integration in NeoEMF has been proposed to further improve performances. (blog post, article) - Mogwaï is a novel model query approach that generates NoSQL database instructions from high-level model queries expressed in OCL. Our solution relies on the rich database query languages that are provided by NoSQL databases (in particular graph implementations) to bypass the modeling stack limitations. Mogwaï is natively integrated in NEOEMF , and our experiments show a significant improvement in terms of execution time and memory consumption when compared to state of the art solutions. (blog post, article)
- Gremlin-ATL is an extension of the Mogwaï approach that supports model transformation expressed in ATL. Our approach embeds a novel transformation execution engine that can be parameterized to scale to large models by storing transformation information in a dedicated data-store, and provides a set of low-level connectors that allow computing transformations on heterogeneous data-sources. (blog post, article)
In addition, we have reused and combined the core MDE techniques introduced in this thesis into a solution dedicated to bridging the gap between conceptual modeling and NoSQL (especially graph databases) infrastructures. UmlToGraphDB (blog post, article) is a model-driven approach that translates conceptual schemas expressed using UML into a graph representation, and generates database-level queries from business rules and invariants defined using OCL. Our approach reuses the implicit model to database mapping embedded in NeoEMF, and integrates the Mogwaï framework to generate database queries ensuring data integrity.
I am a postdoctoral fellow in the Naomod research team and co-founder of Xatkit. My research interests are mostly focused on foundations of Model-Driven Engineering, such as model persistence, querying, and transformation, but also on the application and democratization of MDE techniques to concrete domains.
Hi Gwendal,
I so far cross-read your thesis and have to say damn good work 😉
So far I did not find any hint on concurrent access on the model. In my experience especially big models tend to be created and used by a team and not by a single individual. In a project I was part of some 800 people worked on a single EEPC (enhanced event-driven process chain) model using the rDBMS based Software AG ARIS modeling tool.
/Carsten
Hi Carsten,
I am glad you found our work interesting! Thanks a lot!
Indeed, concurrent model creation, access, and queries are still yet to be done. We wanted to have a set of working, single-user approaches first on which we could add additional features such as concurrent accesses, access control policies, etc
Still, there are some features, especially in NeoEMF, that could be reused to improve this. For example the map backend is very efficient to support concurrent read/write (we partially tested this with PrefetchML, where the prefetcher accesses the model at the same time the main application performs a query). The colulmn backend is also designed to be accessed concurrently, but does not provide feedbacks at the modeling level in case of consistency issue. On the graph side, the current version of the framework relies on an embedded Neo4j server, and thus cannot be accessed concurrently. However, Neo4j fully supports transactions, and moving the current implementation to a remote database would allow to access it concurrently.
Still, implementing concurrent model creation/access requires additional layers to check that the model stays consistent, detect and resolve conflicts, and provide useful feedbacks at the modeling level. Thankfully there are existing tools in the EMF ecosystem that could be reused to implement that, such as EMF Compare, or the versioning API of CDO.
I hope this answer your comment 🙂
Gwendal
Hi Gwendal,
based on my 20+ year experience on rather big systems I can state: the chance that the model gets corrupted by concurrent access is extremely small. The reason simply is, in a team setup team members have almost disjunct duties. In the rare case a conflict occurs, it could be handled manually. Well yes, for a small period of time you have an inconsistant model. But, all architects I worked with so far were aware of and accepted that rare case.
So I would strongly advise you NOT to use transactions to perform model changes. Transactions do not scale well. Simply use a last write wins approach. The last write wins approach is used by SCM (source code management) systems for decades and it is well accepted amang both developers and architects.
BTW, your christian name Gwendal translates — if I remember correctly — to “The White One”. Seems very plausible to me 😉
/Carsten
Hi Carsten,
Thanks for this useful feedback! It is nice to hear that sometimes industrial use cases do not require such advanced features, and that the consistency issues can be handled manually 🙂
However, I still believe that designers should have some tooling to detect inconsistencies in order to easily fix them, even if we do not use transactions to support concurrent model management. Using a simple last write wins without at least a notification mechanism may create some tricky states where a model is not consistent (or ever worst, syntactically correct, but not valid regarding the system to model) and nobody is aware of it.
Anyway, implementing such mechanism is way easier than the full-consistency approach backed-up by database transactions! This could definitely be a good first step towards supporting concurrent model creation and manipulation.
Btw you are right about the translation, however this is a celtic name, not a christian one 😉
Gwendal
Hi Gwendal,
I guess I have to explain how I tend to organise models. One extremely basic design principle and the oldest I am aware of is “divide et conquere”. While that principle is at least 5000 years old, it is still my highest priority design principle. Given that I — if I have the freedom to do so — divide recursively the whole model into submodels. A modeler gets an own model he/she is responsible and accountable for in the sense of RACI (https://en.wikipedia.org/wiki/Responsibility_assignment_matrix). If a model finished a model he/she gets a new one. Each of these models including the root model has a public interface part (= submodel) and a private specification part (= submodel). The model strictly follow the SOLID principles (https://en.wikipedia.org/wiki/SOLID).
Given that the model is system of models and can be persisted distributed at design time. The big advantage is modelers can work offline where ever they want to work. Design includes creating the models, validating the models, reviewing the models and releasing the models. Consequently in this phase there is no need to handle big models.
But as soon as validations or transformations have to be performed on the complete model tree the model to be handled gets rather big and have to exist as a single entity.
Currently I use git (https://git-scm.com/) to store the models whenever possible. The git SCM features local repositories (= offline support), server based repositories (= centralized storage) and staging (= measure to implement quality gates). Beside this EGIT neatly integrates into Eclipse. As a result the persistance part is — at least for me — is already perfectly solved.
What remains is the validation and transformations on the entire model tree as a self-contained single entity. And that part is not yet solved for me.
/Carsten
Thanks for the good work.
Neo4J graphs possibly the best store in town for the job.
Yet neither the store, or other “magic” (Gemfire distributed caches, Gemstone distributed smalltalk, …)
can really allow multiple users to change freely the model, at large and in general.
That leaves us with two choices:
– either keep all users “on the same page” synchronising their views on the model,
as fast as other users make the changes – hopefully each user actually (lazily)reads a small chunk of the whole model –
which is akin to real-time collaborative modelling.
– or impose a discipline on the scope which each modeller is able to change, and freeze for him all others,
yet this would still break models, as i.e. OCL expressions may, through navigation,
refer to model features with would get broken on commit.
Each “iteration” would begin with a decomposition of the models on “touchable” and “untouchable” parts for “individual modelling”,
followed by “individual modelling” work,
and then by a “synthesis” unit of work, where the various deltas on “individual modelling” fragments are put together,
with special emphasis on semantic coherence, reuse and duplication avoidance.
RationalRose models of dozens of units, spanning from requirements to entity, to precise activity modelling,
and million of individual model file lines (RRose models were stored as text)
were sucessfully managed this way (breakdown-model-synthetise) while avoiding model corruption.
Before this practice, “model satitisation” scripts had to be run often in parts and the whole model,
just to remove broken references between units.
So, in short, divide and conquer as very well put by Carsten Pitz.
BTW: with RRose scripting was heavily used,
while “decomposing” to prepare “copies” (model elements and diagrams) of existing models in preparation of new units of work,
and while “synthetising” to create new copies from the previous work and the results of individual modelling.
Scripts injected custom properties on model elements, including
identifiers of the “original” element that the “copies” represent in the original or individual or synthesis model artefacts,
and the history of partial modeling efforts that had been synthetised into each “final” model element.
Yes, all the above made the models even bigger inter-iteration (can clean up at the iteration end),
but hey! as we had to deal with the problem, we just used it as an advantage.
We were working with ClearCase, and making good use of its unique abilities to present a filesystem
finely tuned to expose to the modeller specific versions of units, and permissions on them.
Not all geometry in diagrams could be precisely replicated. Fortunately, the most noticeable handicap were the lines manually routed in object instance diagrams – which we used extensively in analysis and for communication of precise cases, but not part of the “synthesis” effort, which was restricted to classifier modelling.
… then again, even if the tooling was UML1.x, we adopted UML2.x activity modelling, and just made do with the class tool and some scripting – after all UML2.x activities are also classifiers, are they not ?)
> Not all geometry in diagrams could be precisely replicated.
Exactly!!! And that is also the very reason why Eclipse Papyrus warns you if you share diagrams 😉
Yep, just to be sure I was not imagining things, I put my archeologist hat and dove into old storage (not 1/2″ tapes, really!) and find the model:
53 unit files, 1,758,817 lines (~= model properties plus Ids & model element comments)
In an Indiana Jones move, I was able to resurrect a RationalRose 2002 on a Windows Vista VM, and load the model: runs like hell in my (today’s) souped-up intel i7.
RUP Requirements Analysis Model / System Use Case Realisations with precise activity modeling ‘a-la-UML2.x’ with UML1.x tooling (RRose 2002)
Hi Antonio, thanks for your comment!
I agree with the “divide and conquer” approach, this is definitely a good solution to handle large models, as long as you are able to create model fragments that are small enough to be manipulated efficiently.
In our work we consider a particular scenario of model driven reverse engineering, where a model is extracted from an existing code base. The process creates a “Java Model” that represents the application at a low-level of abstraction (methods, blocks, variables, etc). Even if the extraction process partially applies the divide an conquer approach, we often end up with model containing million of elements, that are hard to handle in current modeling frameworks. Instead of fragmenting the resulting model (using tools such as EMF-Fragment for example), we choose to use NoSQL databases, keeping in mind that an advanced storage mechanism would probably be interesting when computing queries and transformations on the entire model (such as model validation and transformations emphasized by Carsten). This is the approach we selected to develop the Mogwaï and Gremlin-ATL frameworks.
Finally, I think that using Git (or any general version control system) to store model is clearly a good solution, as long as you don’t get conflicts. In that case the VCS cannot provide useful hints to help resolving the conflict, and you basically have to do it by hand. A model-based approach like the versioning feature embedded in CDO looks a better approach to me, but is probably harder to integrate in standard development processes.