# Lightweight String Reasoning for OCL

Most domain models include a number of constraints which frequently deal with text data (e.g. checking the right format of a telephone field or checking the uniqueness of a street+city name). As all constraints, we may make mistakes when defining them and end up with a model that is completely useless (since users won’t be able to create valid instances for it).

To solve this problem, Fabian Büttner summarizes our work on enabling (lightweight) string reasoning for OCL constraints. Enter Fabian.

In this post, we present an extension to the recently introduced EMFtoCSP model finder. We will present this extension at the ECMFA’12 conference. The EMFtoCSP tool aids modelers and developers in various validation and verification tasks: It takes a model (UML or Ecore), a set of OCL well-formedness rules, and a set of search bounds. It returns either a valid (“satisfying”) instance of the input model, or – if there is no such instance – returns “unsatisfiable”.

Because the model finding problem is computationally hard in general, the actual heavy lifting is performed by a sophisticated underlying solver, the ECLiPSe constraint logic programming environment. Other approaches follow similar ways and employ, for example, SAT or SMT solvers.

A common weak spot in all current solutions that we are aware of is, however, the treatment of OCL constraints that involve the UML/OCL data type String. The tools either do not support string constraints at all, or they encode strings a fixed-length vectors of variables. Obviously, a fixed-length encoding is suited only for reasonable small maximum string lengths, but in several cases, we are interested in larger maximum lengths, even if only a small fraction of the string values in our instances actually reaches that limit. This point is addressed by our extension.

As we consider the solving of string constraints only a sub-problem of the overall model finding process, we have been searching for an inexpensive, light-weight approach to integrate string constraints into EMFtoCSP (but still providing some symbolic reasoning). We did not incorporate one of the “full grown” formal methods that are available for reasoning about regular languages. Our core idea is to treat strings differently in the individual stages of the search process:

- As long as possible, we treat string variable (e.g., attribute values) as symbolic variables. At this stage, we mainly reason about the possible lengths of our string values. Furthermore, we process certain constraints symbolically, when this is can be done without too much overhead. For example, from s1.indexOf(s2 + s3) > 0 and s1.size = N, we infer s2.length + s3.length = N. Also, we would, for example, detect that s1 = s2 and s1 <> s2 are unsatisfiable. We employ Constraint Handling Rules to implement the reasoning rules.
- After processing the lightweight string constraints (and all other OCL constraints), we have a candidate instance of our model. Only at this point, we finally turn the symbolic string variables into lists of element variables, and the remaining constraint fragments into constraints on these variables. Here we can see the difference to a fixed length encoding: If we have 1000 strings of length 2 and one string of length 1000, we require 3000 element variables. A fixed length encoding would require 1.000.000 variables!

This economic approach works fine for many models and lightweight strings constraints (for “hard string puzzles”, bit-blasting techniques are better suited). With our prototypical implementation, we could identify instances with thousands of strings in short time. While we used ECLiPSe to implement this reasoning scheme, we expect that it can be easily incorporated into similar model finders as well.

A simple illustration of the limits of a fixed-width encoding of strings: This Alloy specification models ten objects, each having string of up to 127 elements (8 bits).

sig S { a_str : seq Int }

run {} for exactly 100 S, 127 seq, 8 int

Finding a solution for this specification (without even adding any further facts) takes three minutes on my computer. For bigger numbers it gets much worse. The reason is the construction of the CNF formula, which consists of more the 3.5 million clauses and more than 2 million variables. With our approach, Strings of such length can be easily handled (assuming that the constraints that we pose on them are not too hard).