I was invited to participate in the panel Big Data and Conceptual models: Are they mutually compatible?, part of the ER 2016 conference (where we were already presenting our UMLtoGraphDB , metaScience, and human factors in MDE works).
The panel was organized as a highly interactive session where panelists with questions from the organizer (Sudha Ram) and the audience but without an initial position statement. Therefore, for he record, I’m going to use this post to make such statement 🙂 . So, here we go, my key messages regarding the relationship between (conceptual) models and the world of big data:
- Big data is not schemaless. At most, we can say it is “less-schema” than other data
- When accessing the data you need a schema that helps to interpret that data. If there is no explicit model to use, you have to infer one (e.g. using JSON Discoverer)
- For big data, models are not a static, fixed and complete artifact, rather a partial, dynamic and temporal view of the data to facilitate manipulating the data at that specific instant.
- In traditional software development, we follow a “model-down” approach. For big data, we have to switch to a “data-up” one. I.e. the data is what drives the models we have to use and not the models the ones that define what data we can have in the system
- Uncertainty becomes a first-class citizen: we may not be sure about the schema to use, about the quality of the data, about the reliability of the source,… Every interpretation comes with an uncertain probability of being correct
- Big data is more and more linked to APIs since plenty of data is being released under some kind of web API instead of using linked data / web semantic technologies. This is especially true for open data
- The user of Big data is more and more a non-technical end user (the so-called “citizen developer”). Modeling for big data approaches need to keep this profile in mind. This is for instance one of the goals of our “Open data for All” funded project
- Models can also play a key role in achieving interoperability between different data sources
- Models of big data don’t need to be unique, they can even be personal models: modeling the specific parts that a given user wants to explore on the data.
- Temporal and spatial properties are key elements in big data. Most modeling languages are not good at representing spatial or temporal information.
As always, happy to hear your opinion and listen to your disagreements!
ICREA Research Professor at Internet Interdisciplinary Institute (UOC). Leader of the SOM Research Lab focusing on the broad area of systems and software engineering. Home page.
On one end, big data sounds like “drinking from the fire hose”. On the other end, big data sounds like a copout. Coming from a methodology where one of the major proponents has stated, “Uncertainty is not an option!”, big data sounds like a good methodology to ignore. I think
I’m really confused by the use of the terms temporal and spatial, and the statement, “Most modeling languages are not good at representing spatial or temporal information.” There are many ways to deal with time aspects in modeling. Spatial can be logical-structural or physical containment; I’m not sure which is referenced.
Primitives for easy representing time and geospatial concepts are typically missing from any modeling language. E.g. OCL does not even have a date type. And of course, it’s not possible to represent temporal constraints (in the sense of what you could do with linear temporal logic). Sure, you can extend the language and there are many proposals for that (e.g. my own: https://www.researchgate.net/publication/225173022_Representing_Temporal_Information_in_UML ) but they are just proposals, none made it to the standard. So, yes, you can find your workaround but it’s not easy
Shlaer-Mellor modeling has always had time and date as a core types for as long as I’ve been using it. Geospatial types have always been handled as user defined types, either modeled as classes for manipulation in the problem space, or composite (unknown) data for manipulation outside the domain.
Temporal constraints as shown in the linked proposal have always been a combination of domains (subject matters), because as the proposal notes, the temporal aspects can clutter the subject matter being presented.
Associative classes are the mechanism for temporality in the modeled domain, but the persisting is done in an external domain. The past or future usage is often shown in a model by how the association and associative classes are named. Employee*–*Project could have an Employee Assignment associative class for temporal external persisting. The usage for past or future can be indicated with verb phrases.
The past usage is Employee’is assignment for’*–*’is working on’Project, which reads as, “Employee may be working on one or more Projects” and “Project is assignment for zero to many Employees” with an Employee Assignment for each Employee instance to Project instance relationship.
The future usage is Employee’has scheduled’*–*’is scheduled for’Project, which reads as, “Employee may be scheduled for one or more Projects” and “Project has scheduled zero to many Employees” with an Employee Assignment for each Employee instance to Project instance relationship.
Probably an oversimplified example, but the real takeaway is that UML is not a universal language; with proper separation of concerns, it doesn’t need to be, but there are also many necessary things not specified and/or implemented yet. The pursuit of those things is never a waste of time.