to define a database schema is similar to the distinction between static and dynamic typing in programming languages.<\/span><\/span><\/p>\nSchemaless databases offer some advantages that can result very useful in scenarios where the changes <\/span><\/span>in<\/span><\/span> the data structure are frequent [1]. For example, they facilitate to have custom fields and non-uniform types for database entities, and data with a new structure can be added at any moment <\/span><\/span>without a schema that would impose those restrictions.<\/span><\/span> However, this flexibility should not be obtained at the expense of losing the benefits provided by having schemas.<\/span><\/span><\/p>\nDevelopers need to keep in mind the implicit schema when they write (or read) code of applications that manage NoSQL databases. Also, database tools usually require the knowledge of a schema to implement their functionality.<\/span><\/span><\/p>\nTherefore, the NoSQL schema extraction is increasingly receiving attention from industry and academi<\/span><\/span>a<\/span><\/span>, as discussed in [2]. The report \u201cInsights into NoSQL Modeling\u201d (Dataversity, 2015) [3] highlighted that data modeling will be a crucial activity for NoSQL databases and drew attention on the need for NoSQL tools <\/span><\/span>to<\/span><\/span> provide functionality similar to those available for relational databases. In particular, three main types of desired functionalities were identified from the survey carried with data management experts: diagramming, code generation, and metadata management. The report also remarked that schema discovery would be a common task to be implemented to achieve such functionalities.<\/span><\/span><\/p>\nSchemas for NoSQL Databases<\/h2>\n
\u201cNoSQL database<\/span><\/span>s<\/span><\/span>\u201d is really used to denote a varied set of database modeling paradigms that are grouped <\/span><\/span>usually<\/span><\/span> in four main types: document, wide column, key-value stores and graph-based databases. The three former types are categorized as \u201caggregation-oriented paradigms\u201d because the object aggregations are prevalent over connections between objects (i.e. references). More details on this classification can be found in [5].<\/span><\/span><\/p>\nThe notion of schema is well-defined for relational databases. However, NoSQL databases can store several versions or variations of a particular entity. For example, a movie database could have movie and director objects with different structure. Next, we show a movie database example that includes 3 versions for movies objects and 3 versions for director objects. We will use this example to illustrate the schema visualization.<\/span><\/span><\/p>\n<\/p>\n
Taking into account that data of the same entity can be stored with different structures (i.e. non-uniform types), we have considered several notions of schema for NoSQL databases:<\/span><\/span><\/p>\n\n- Schema object <\/b><\/span><\/span>(or object type): it is obtained by replacing, recursively, the atomic values of a semi-structured object (JSON in our case) by an identifier that denotes its type (i.e. String, Number).<\/span><\/span>The schema extraction process analyzes this set of object schemas to discover the set of entities and relationships between them.<\/span><\/span><\/li>\n
- Entity version schema<\/b><\/span><\/span> (or simply version schema): it is obtained from the <\/span><\/span>object schema of an entity version by replacing each embedded and referenced <\/span><\/span>objects by the corresponding name of the embedded or target entity version, <\/span><\/span>respectively. These schemas can specify both root <\/span><\/span>(root version schema<\/b><\/span><\/span>) and <\/span><\/span>embedded objects (<\/span><\/span>embedded version schema<\/b><\/span><\/span>). Next, we show <\/span><\/span>the root version schema for the movie object with _id=\u201d1\u201d (Movie_1, each version is named by the entity name followed by the id number).<\/span><\/span><\/li>\n<\/ul>\n
{<\/span><\/span><\/span><\/span>\r\n \"title \"<\/span><\/span><\/span><\/span>: <\/span><\/span><\/span><\/span>\"String\"<\/span><\/span><\/span><\/span>,<\/span><\/span><\/span><\/span>\r\n \"year \"<\/span><\/span><\/span><\/span>: <\/span><\/span><\/span><\/span>\"Number\"<\/span><\/span><\/span><\/span>,<\/span><\/span><\/span><\/span>\r\n \"genre \"<\/span><\/span><\/span><\/span>: <\/span><\/span><\/span><\/span>\"String\"<\/span><\/span><\/span><\/span>,<\/span><\/span><\/span><\/span>\r\n \"director_id \"<\/span><\/span><\/span><\/span>: <\/span><\/span><\/span><\/span>\"ref ( Director )\"<\/span><\/span><\/span><\/span>,<\/span><\/span><\/span><\/span>\r\n \"prizes\"<\/span><\/span><\/span><\/span>: <\/span><\/span><\/span><\/span>\"Prize_1\"<\/span><\/span><\/span><\/span>,<\/span><\/span><\/span><\/span>\r\n \"criticisms \"<\/span><\/span><\/span><\/span>: <\/span><\/span><\/span><\/span>[<\/span><\/span><\/span><\/span>\"Criticism_1\"<\/span><\/span><\/span><\/span>,<\/span><\/span><\/span><\/span> \"Criticism_2\"<\/span><\/span><\/span><\/span>]<\/span><\/span><\/span><\/span>\r\n}<\/span><\/span><\/span><\/span><\/pre>\n\n- Entity schema<\/b><\/span><\/span>: <\/span><\/span>T<\/span><\/span>he set of version schemas of a given entity.<\/span><\/span><\/li>\n
- Entity union schema<\/b><\/span><\/span>: It is a view of all the version schemas of an entity. It can be obtained by joining all the properties contained in the version schemas and applying some rule<\/span><\/span>s<\/span><\/span> to solve name conflict<\/span><\/span>s<\/span><\/span>. We have applied the following: when a property name appears in more than one version schema and the type differs in some of them, the union type is applied. The union schema for the two movie entities of the movie database example would be the following:<\/span><\/span><\/li>\n<\/ul>\n
{<\/span><\/span><\/span><\/span>\r\n \"title\"<\/span><\/span><\/span><\/span>: <\/span><\/span><\/span><\/span>