In this post we present our latests updates on NeoEMF, our solution to store and access large models in a scalable way.
NeoEMF is a multi-database model persistence solution, that is able to store models in several kind of NoSQL datastores. The framework is fully compatible with the EMF API, and allows to switch from one datastore to another using EMF’s Resource options. Users can then chose the datastore that fits the best it needs for a specific modeling task, and change to another later on if needed.
NeoEMF is based on a lazy-loading approach that brings into memory model elements only when they are accessed. The current version of the framework embeds three datastores:
- NeoEMF/Graph: persists models in a graph databases to take advantages of the rich traversal language they provide (shortest path computation, complex navigation paths among several vertices/edges)
- NeoEMF/Map: persists models in a collection of in-memory/on-disk maps that provide fast access to atomic operation
- NeoEMF/Column: persists models in a distributed column-based datastore, allowing to distribute intensive read/write workloads accross datanodes
This version of NeoEMF will be presented in the demo session of the Models’16 conference. You can find the complete article here, or continue reading below. For more information about the framework itself you can visit NeoEMF website, and the project wiki on Github which gives some technical details on how to use NeoEMF. We’ll be happy to discuss about our approach in the comments! (have you ever used NeoEMF? What datastore would you like to add ? For which scenario ? …)
Introduction to NeoEMF
With the progressive adoption of MDE techniques in industry , existing model persistence solutions have to address scalability issues to store, query, and transform large and complex models . Indeed, existing modeling frameworks were first designed to handle simple modeling activities, and often rely on XMI-based serialization to store models. While this format is a good fit for small models, it has shown clear limitations when scaling to large ones .
To overcome these limitations, several persistence frameworks based on relational and NoSQL databases have been proposed [5, 7, 11]. They rely on a lazy-loading mechanism, which reduce memory consumption by loading only accessed objects. These solutions have proven their efficiency compared to state-of-the-art tools, but they are often tailored to a specific data-store implementation.
In these approaches, the choice of the datastore is totally decoupled from the expected model usage (for example complex querying, interactive editing, or complex model-to-model transformation): the persistence layer offers generic scalability improvements, but is not optimized for a specific scenario. For example, a graph-based representation of a model can improve scalability by offering a lazy-loading mechanism, but will have poor execution time performance in scenarios involving repeated atomic value accesses.
Our previous work on model persistence have shown that providing a well-suited data store for a specific modeling scenario can dramatically improve performance. Based on this observation, we present in this article NEOEMF, a scalable model persistence framework based on a modular architecture enabling model storage into multiple data stores. Currently, NEOEMF provides three implementations-map, graph, and column–each one optimized for a specific usage scenario. NEOEMF provides two APIs, one strictly compatible to the Eclipse Modeling Framework (EMF) API, easing its integration into existing modeling tools, and an advanced API that provides specific features that bypass the standard EMF API to further improve scalability of particular modeling scenarios.
The rest of the paper is organized as follows: Section 2 presents an overview of the NEOEMF architecture, Section 3 and 4 present the core features of the framework and the different datastores. Section 5 provides insights on the framework’s implementation, and finally Section 6 summarizes the key points of the paper and presents our future work. Note that examples of NEOEMF usages are provided on NEOEMF’s wiki and in a demonstration video available online. It highlights NEOEMF’s core features such as model import, API usage, and the lazy model editor which allows to navigate interactively large models with a low memory footprint. The demonstration also present a concrete use case by showing the different steps needed to integrate NEOEMF into an existing EMF-based application, and use it to store and query models containing several million of elements. Finally, the demonstration presents an overview of two tools developed on top of NEOEMF: the Mogwa¨ı query framework and ATL-MR, a distributed version of the ATL transformation engine.
Figure 1 describes the integration of NEOEMF in the EMF ecosystem. Modelers typically access a model using Model-based Tools, which provide high-level modeling features such as a graphical interface, interactive console, or query editor. These features internally rely on EMF’s Model Access API to navigate models, perform CRUD operations, check constraints, etc. In its core, EMF delegates the operations to a persistence manager using its Persistence API, which is in charge of the serialization/deserialization of the model. The NEOEMF core component is defined at this level, and can be registered as a persistence manager for EMF, same as, for example, the default XMI persistence manager. This design makes NEOEMF both transparent to the client application and EMF itself, that simply delegates calls without taking care of the actual storage.
Fig. 1: NeoEMF Integration in EMF Ecosystem
Once the core component has received the modeling operation to perform, it forwards the operation to the appropriate database driver (Map, Graph , or Column), which is in charge of handling the low-level representation of the model. These connectors translate modeling operations into Backend API calls, store the results, and reify database records into EMF EObjects when needed. NEOEMF also embeds a set of default caching strategies that are used to improve performance of client applications, and can be configured transparently at the EMF API level.
In addition to this transparent integration into existing EMF applications, NEOEMF provides a specific API, which targets advanced users / high-performance applications. This API provides utility methods which overcome EMF limitations, allow fine-grained tuning of the databases, and access to internal caches. By using this API, NEOEMF can be tuned to improve execution time and/or scalability of a specific modeling scenario.
To provide this smooth integration into the EMF infrastructure, the NEOEMF core component redefines the behavior of several EMF classes. For instance, each NEOEMF driver defines a specific implementation of PersistenceBackendFactory that is responsible of the concrete data store creation. This factory creates an instance of the data store that corresponds to the Resource options. Once the data store has been created, the driver instantiates a specific implementation of the EStore interface–also depending on the Resource options–that translates the delegated method calls into datastore specific API calls. This architecture allows to change the underlying data store by simply updating the Resource options. The EStore also returns PersistentEObject from the database when needed, using a specific reification mechanism.
As introduced in the previous Section, NEOEMF provides two API levels: one for a standard use of existing EMF applications / APIs, and one advanced that allows to bypass EMF’s limitations, tune internal data stores, and configure caches. In this Section we present first the standard features, available simply by plugging NEOEMF into an existing application, then we introduce its advanced features.
An important characteristic of NEOEMF is its compliance with the EMF API. All classes/interfaces extending existing EMF ones strictly define all their methods, and we put a special attention to ensure that calling a NEOEMF method produces the same behavior (including possible side effects) as standard EMF API calls. As a result, existing applications can integrate NEOEMF with a very small amount of efforts and benefit immediately from NEOEMF scalability improvements. Existing code manipulating regular EMF EObjects does not have to be modified, and will behave as expected.
Specifically, NEOEMF supports the following EMF features:
- Code generation: NeoEMF provides a dedicated code generator that transparently extends the EMF one, and allows client applications to manipulate models using generated Java classes.
- Reflexive/Dynamic API: reflexive and dynamic EMF methods (eSet, eGet, eUnset, eDynamicGet, eDynamicSet …) can be used on NeoEMF objects, and behave as their standard implementations.
- Resource API: NeoEMF also implements the resource specific API, such as getContents, getAllContents, save, and load methods. In addition, NeoEMF takes advantages of the flexible save and load options to enable backend-specific customizations.
As other model persistence solutions [5, 11], NEOEMF achieves scalability using a lazy-loading mechanism, which loads into memory objects only when they are accessed, overcoming XMI’s limitations. Lazy-loading is defined at the core component: NEOEMF implementation of EObject consists of a simple wrapper delegating all its method calls to an EStore, that directly manipulates elements at the database level. Using this technique, NEOEMF benefits from datastore optimizations (such as caches), and only maintains a small amount of elements in memory (the ones that have not been saved), reducing drastically the memory consumption of modeling applications.
In addition to its compliance with the EMF API, NEOEMF provides specific utility features to tackle EMF’s limitations, such as the List<EObject> allInstances(EClass eClass ) method, which is accessible through the PersistentResource interface. This feature tackles the problem of allInstances computation in EMF  by delegating it to the data store, allowing to retrieve requested element fastly, using data store indexes, or specific data representation.
NEOEMF also includes an io module, providing a scalable Model Importer, that consists of an event-based XMI parser that bypasses the EMF API to efficiently store the model in a dedicated database with a low memory footprint. The importer is designed to be generic and can be implemented in each backend component. We also plan to add an efficient Model Exporter module that would allow to produce optimized model serializations from their database representation.
Finally, NeoEMF contains a set of caching strategies that can be plugged on top of the data store according to specific needs. Note that these caches are available for all connectors, unless otherwise stated.
- EStructuralFeaturesCaching: a LRU cache storing loaded objects by their accessed feature.
- IsSetCaching: a cache keeping the result of isSet calls to avoid multiple accesses to the database.
- SizeCaching: a cache keeping the result of size calls on multi-valued features to avoid multiple accesses to the database.
- RecordCaches: a set of database-specific caches maintaining a list of records to improve execution time.
These caches can be configured using the save and load Resource methods, which allows to add specific options which are then forwarded to the appropriate PersistenceBackendFactory.
NoSQL Datastores in NeoEMF
The previous features are available for a variety of data stores supported by NEOEMF. In this section we introduce the different datastores available. We introduce briefly model representation in these stores and describe their differences and the specific modeling scenario they better address. Both, standard and advanced, features presented in the previous section are implemented in the supported datastores.
NEOEMF/MAP  has been designed to provide fast access to atomic operations, such as accessing a single element/attribute, and navigating a single reference. This implementation is optimized for EMF API-based accesses, which typically generate atomic and fragmented calls on the model. NEOEMF/MAP embeds a key-value store, which maintains a set of in-memory/on disk maps to speed up model element accesses. The benchmarks performed in previous work  show that NEOEMF/MAP is the most suitable solution to improve performance and scalability of EMF API-based tools that need to access very large models on a single machine.
NEOEMF/MAP data model is composed of three different maps that store model information: (i) a property map, which keeps all objects data in a centralized place; (ii) a type map, which tracks how objects relate to the meta-level (such as the instance of relationships); and (iii) a containment map, which defines the model structure in terms of containment references.
NEOEMF/GRAPH  persists models in an embedded graph database that represents model elements as vertices, attributes as vertex properties, and references as edges. Metamodel elements are also persisted as vertices in the graph, and are linked to their instances through the INSTANCE_OF relationship.
Using graphs to store models allows NEOEMF to benefit from the rich traversal features that graph databases usually provide, such as fast shortest-path computation, or efficient complex navigation paths among several vertices/edges. These advanced query capabilities have been used to develop the Mogwa¨ı  tool, that maps OCL expressions to graph navigation traversals. On the other hand, graph databases are not well-suited to compute atomic accesses of single elements or attributes, which are typical queries computed in interactive model edition.
NEOEMF/COLUMN  has been designed to enable the development of distributed MDE-based applications by relying on a distributed column-based datastore. NEOEMF/COLUMN uses a single table with three column families to store model information: (i) a property column family that keeps all objects data stored together; (ii) a type column family that tracks how objects relate to the meta-level (such as the instance of relationships); and (iii) a containment column family that defines the model structure in terms of containment hierarchy.
In contrast with Map and Graph implementations, NEOEMF/COLUMN offers concurrent read/write capabilities and guarantees ACID properties at model element level. It exploits the wide availability of distributed clusters in order to distribute intensive read/write workloads across datanodes. The distributed nature of this persistence solution is used in the ATL-MR  tool, a distributed engine for model transformations in the ATL language on top of MapReduce. NEOEMF/COLUMN, enables the cluster’s nodes to share read/write rights over the same set of input/output models.
Implementation of NeoEMF
NEOEMF has been implemented as a set of open source Eclipse plugins distributed under the EPL license. The NEOEMF website presents an overview of the key features and current ongoing work, and the source code repository is fully available on GitHub. NEOEMF has been released as part of the MONDO platform .
The NEOEMF/GRAPH implementation relies on Blueprints , an interface designed to unify graph databases under a common API. Blueprints has been implemented by a large number of databases, such as Neo4j, OrientDB, and Titan. The use of this abstraction layer on top of graph databases enable client applications to use the graph implementation of their choice, as long as it implements the Blueprints API. For now, NEOEMF/GRAPH embeds Blueprints 2.5.0 and provides a convenience wrapper for Neo4j 1.9.6. An implementation relying on the new Blueprints API (called Tinkerpop3) is under study for now, as well as the creation of additional database wrappers.
NEOEMF/MAP embeds the key-value store MapDB 1.0.9. MapDB provides Maps, Sets, Lists, Queues and other collections backed by off-heap or on-disk storage, and describes itself as a hybrid between Java Collections and an embedded database engine . It provides advanced features such as ACID transactions, snapshots, and incremental backups. NEOEMF/MAP relies on the set of Maps provided by MapDB and uses them as a key-value store.
NEOEMF/COLUMN is built on top of Apache HBase  0.98.13-hadoop2, a non-relational wide column database providing distributed data storage on top of HDFS. It is able to host very large tables–billions of rows containing millions of columns–atop clusters of commodity hardware. Model distribution is hidden from client applications, which accesses the elements transparently using the standard EMF API.
Finally, NEOEMF provides a model browser that allows to explore and edit large models in an EMF-like editor. It relies on the lazy-loading feature of the framework, which allows to load into memory only the elements that are displayed, limiting the memory consumption. NeoEMF browser can be used to open any NeoEMF resource, regardless the backend used to store the model. Figure 2 shows a snapshot of the browser, where a large model (jdt-core.graph) containing two million elements is displayed and can be manipulated.
Fig. 2: NeoEMF Lazy Editor
In this article we have presented NeoEMF, a multi-datastore model persistence framework. It relies on a lazy-loading capability allowing very large model navigation in a reduced amount of memory, by loading elements when they are accessed. NeoEMF provides three implementations that can be plugged transparently to provide an optimized solution to different modeling use cases: atomic accesses through interactive editing, complex query computation, and cloud-based model transformation.
 Apache. Apache HBase, 2016. U R L: https://hbase.apache.org/.
 Amine Benelallam, Abel Gómez, Gerson Sunyé, Massimo Tisi, and David Launay. Neo4EMF, a Scalable Persistence Layer for EMF Models. In Proc. of the 10th ECMFA, pages 230–241, 2014.
 Amine Benelallam, Abel Gómez, Massimo Tisi, and Jordi Cabot. Distributed Model-to-Model Transformation with ATL on MapReduce. In Proc. of the 8th SLE Conference, pages 37–48. ACM, 2015.
 Gwendal Daniel, Gerson Sunyé, and Jordi Cabot. Mogwaï: a Framework to Handle Complex Queries on Large Models. In Proc. of the 10th RCIS Conference (to appear). IEEE, 2016. Available Online at http://tinyurl.com/jgopmvk.
 Eclipse Foundation. The CDO Model Repository (CDO), 2016. U R L: http://www.eclipse.org/cdo/.
 Abel Gómez, Amine Benelallam, and Massimo Tisi. Decentralized Model Persistence for Distributed Computing. In Proc. of the 3rd BigMDE Workshop, pages 42–51. CEUR-WS.org, 2015.
 Abel Gómez, Gerson Sunyé, Massimo Tisi, and Jordi Cabot. Map-based Transparent Persistence for Very Large Models. In Proc. of the 18th FASE Conference, pages 19–34. Springer, 2015.
 Dimitrios S. Kolovos, Louis M. Rose, Richard F. Paige, Esther Guerra, Jesús Sánchez Cuadrado, Juan de Lara, István Ráth, Dániel Varró , Gerson Sunyé, and Massimo Tisi. MONDO: Scalable Modelling and Model Management on the Cloud. In Proc. of the Projects Showcase, (STAF 2015), pages 44–53, 2015.
 MapDB. MapDB, 2016. U R L: www.mapdb.org.
 Parastoo Mohagheghi, Miguel A Fernandez, Juan A Martell, Mathias Fritzsche, and Wasif Gilani. MDE Adoption in Industry: Challenges and Success Criteria. In Models in Software Engineering, pages 54–59. Springer, 2009.
 Javier Espinazo Pagán and Jesús García Molina. Querying Large Models Efficiently. IST, 2014.
 Tinkerpop. Blueprints API, 2016. U R L: blueprints.tinkerpop.com.
 JB Warmer and AG Kleppe. Building a Flexible Software Factory using Partial Domain Specific Models. In Proc. of the 6th OOPSLA DSM Workshop. University of Jyvaskyla, 2006.
 Ran Wei and Dimitrios S Kolovos. An Efficient Computation Strategy for allInstances(). In Proc. of the 3rd BigMDE Workshop, pages 32–42. CEUR-WS.org, 2015.