With the goal to identify and extract text strings from images in scanned documents, pictures of handwritten text, video frames, etc., Optical Character Recognition (OCR) systems were born years ago and are still a very active research area [1] that benefits from advances in computer vision, neural networks, machine translation, etc.

Current efforts focus mostly on mainstream natural languages for which there is ample available data to be used for training, predefined language models and public dictionaries that help achieve high levels of accuracy in the OCR process. Instead, specific OCR support decreases considerably when targeting programming languages – although some works to extract source code from programming video tutorials have appeared lately [2,3,4] – and even more when addressing Domain-Specific Languages (DSLs) where, due to their own nature and unlike general-purpose languages (GPL), we do not have predefined dictionaries or pretrained recognition algorithms available. For instance, unlike GPLs, we cannot assume that code repositories like GitHub have enough good examples of any DSL we can think of to train a OCR model for the language from scratch.

Nevertheless, good OCR support for DSLs could bring significant benefits and open the door to interesting applications in the field of DSLs. For instance, one could parse old manuals of legacy DSLs (or even conference proceedings from past or related SLE conferences) to automatically extract examples, which could be later used as test data for new parsers or to train any machine learning-based algorithms. In these cases, numerous examples are needed, and common solutions such as the generation of synthetic [5] data may not be optimal. When possible, our approach for supporting OCR for DSLs would be an additional source of data. From a teaching perspective, such OCR for DSLs could help in processing student assignments for automatic assessment.

Additionally, DSLs are currently also documented by means of video tutorials, as in the case of general programming languages. Furthermore, there is the specific case of graphical DSLs, whose graphical notation is complemented with textual languages. In those cases, the textual DSL expressions often appear as annotations next to the referenced graphical elements. Those annotated diagrams are usually stored, published, or shared as images. OCL[6] is an example of a textual language that complements UML [7].  While complex OCL expressions are better defined in separate files, short ones are usually depicted as notes in UML class diagrams (or other UML diagrams).

In the same way that Language Workbenches [8] offer features such as the generation of parsers and autocompletion from a DSL (Domain-Specific Language) definition, we believe that Language Workbenches should also come with OCR support. For instance, they could automatically generate a tailor-made OCR configuration or post-processing for any given language.

In this blog post, we summarize our vision about what are the main challenges of OCR support for textual DSLs and discuss several alternatives to improve recognition quality by leveraging the DSL specification and available domain data. This work is part of the Software Language Engineering Conference 2020 and co-authored by Jorge Perianez-Pascual, Roberto Rodriguez-Echeverria, Loli Burgueño, and Jordi Cabot. You can read the full paper or the summary below. See also the video of our presentation

Challenges in OCR support for DSLs

Unlike the recognition of natural language (NL), DSL snippet recognition presents additional challenges.

  • First, since DSL snippets need to be eventually processed by a language IDE, error-free recognition is essential. For instance, missing a “.” character at the end of a sentence written in English does not prevent the reader from understanding its meaning or make a text processor fail, but missing a “.” in a piece of code will make the language parser not be able to build the expression abstract-syntax tree (AST).
  • Second, when the recognition of a sequence of characters that forms a word has a low-accuracy, pre-trained OCRs try to return the closest word in their dictionary. While pre-trained OCRs have been trained for Natural Language (NL), whose vocabulary is closed (e.g., English dictionary), the quality of their results decreases when used to recognize DSL expressions, since each DSLs has its own specific lexicon and grammar.
  • Furthermore, punctuation signs are applied in a very different way in a DSL compared to NL (e.g., “.” characters are used as separators between two words in addition to whitespaces hence OCR models for NL tend to insert whitespace after each “.” recognized). For this reason, which is also supported by our own experience, most of the recognized DSL snippets present some kind of syntax error, which prevents their proper load into language IDEs. Regarding the way an OCR works, assuming the original DSL snippet was correct, the types of syntax errors can be reduced to the following two: (1) not found symbol, e.g., the OCR changed, merged, or split one of the symbols; and (2) punctuation or operator missing or appearing in an unexpected position, e.g., missed “)” or expected “””.

How to improve OCR recognition for DSLs

Although the easiest way to address these issues might be to (re)train an OCR engine with DSL images of code, usually the amount of data needed is much larger than the data available.  For instance, Tesseract, which is one of the most popular OCR engines, has been trained with 400,000-800,000 rendered lines of NL text.  While this amount of data could be collected from GitHub repositories for well-known GPLs such as Java or Python, we believe this is hardly (if not impossible) achievable for most DSLs.

For DSL recognition, as an alternative method, we propose to use as much knowledge as possible from the own DSL specification to drive the recognition process. We distinguish five fundamental groups of elements in any DSL excerpt:

  1. DSL Punctuation (punctuation signs and operators)
  2. DSL lexicon (i.e. keywords, functions, types)
  3. Domain names (when applicable).   Note that we use the term domain names to refer to different information sources, which could appear in different application domains, e.g. the names of a database schema for SQL, the names of a metamodel for OCL or the names of an API definition for a programming language.
  4. User-defined names (e.g., variables)
  5. Literal values (e.g., ‘John Smith’). While the former 3 groups depend on the DSL and can be captured in a dictionary (i.e., they can be derived from language grammar and additional existing artifacts), the last 2 contain information that may vary from snippet to snippet.

The definition of the DSL can help in optimizing the recognition of the first three groups above.

Finally, an additional challenge appears when proposing any method for OCR quality improvement: OCR independence. Any method should be seamlessly applicable to different OCR engines without introducing any overhead.

Empirical Study

To get a first evaluation of the aforementioned ideas about OCR support for DSLs, we conducted an empirical study with OCL as a particular case of DSL, and Tesseract which is one of the most popular and accurate OCR systems and broadly used in programming languages transcription.

We applied four different strategies,  based on the ideas mentioned above, to improve the OCR recognition of OCL expressions. To do so, we took the dataset from [9], which contains 4,774 OCL expressions from 504 EMF metamodels coming from 245 systematically selected GitHub repositories. We removed faulty or invalid metamodels and/or OCL expressions and for each valid OCL expression we generated 10 different images using the 10 different fonts (simulating both computer and hand-written types of fonts). For more details about our experiment setup, please check our paper [10].

Our experiments show that the default configuration of Tesseract produces a significant percentage of correct expressions for computer fonts, around a 60%. Nevertheless, our approach for empowering the OCR engine with repairing strategies has shown to have potential. These strategies improve the recognition between 10%-20% and they do not have an important performance overhead (+0.1 sec).

Finally, we have observed that all strategies work better for computer-like fonts (F1–F8) than for hand-written fonts (F9–F10). Even in this case, our repairing algorithm obtains the best improvement (24%) for the Handlee font (F10).

We plan to keep working on these and new strategies to achieve additional improvements in the Optical Character Recognition for Domain-Specific Languages!

References

[1] Jamshed Memon, Maira Sami, and Rizwan Ahmed Khan. 2020. Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR). arXiv:cs.CV/2001.00139

[2] Kandarp Khandwala and Philip J. Guo. 2018. Codemotion: Expanding the Design Space of Learner Interactions with Computer Programming Tutorial Videos. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale. ACM, New York, NY, USA, 1–10. https://doi.org/10.1145/3231644.3231652

[3] Luca Ponzanelli, Gabriele Bavota, Andrea Mocci, Rocco Oliveto, Massimiliano DI Penta, Sonia Haiduc, Barbara Russo, and Michele Lanza. 2019. Automatic Identification and Classification of Software Development Video Tutorial Fragments. IEEE Transactions on Software Engineering 45, 5 (may 2019), 464–488. https://doi.org/10.1109/TSE.2017.2779479

[4] Shir Yadid and Eran Yahav. 2016. Extracting code from programming tutorial videos. In Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software – Onward! 2016. ACM Press, New York, New York, USA, 98–111. https://doi.org/10.1145/2986012.2986021

[5] Sergey I. Nikolenko. 2019. Synthetic Data for Deep Learning. arXiv:cs.LG/1909.11512

[6] Jordi Cabot and Martin Gogolla. 2012. Object Constraint Language (OCL): A Definitive Guide. In Formal Methods for Model-Driven Engineering – 12th International School on Formal Methods for the Design of Computer, Communication, and Software Systems, SFM 2012, Bertinoro, Italy, June 18-23, 2012. Advanced Lectures. 58–90. https://doi.org/10.1007/978-3-642-30982-3_3

[7] Object Management Group. 2015. Unified Modeling Language (UML) Specification. Version 2.5. OMG document formal/2015-03-01.

[8] Sebastian Erdweg, Tijs van der Storm, Markus Völter, Laurence Tratt, Remi Bosman, William R. Cook, Albert Gerritsen, Angelo Hulshout, Steven Kelly, Alex Loh, Gabriël D. P. Konat, Pedro J. Molina, Martin Palatnik, Risto Pohjonen, Eugen Schindler, Klemens Schindler, Riccardo Solmi, Vlad A. Vergu, Eelco Visser, Kevin van der Vlist, Guido Wachsmuth, and Jimi van der Woning. 2015. Evaluating and comparing language workbenches: Existing results and benchmarks for the future. Comput. Lang. Syst. Struct. 44 (2015), 24–47. https://doi.org/10.1016/j.cl.2015.08.007

[9] J. Noten, J. G. M. Mengerink and A. Serebrenik, “A Data Set of OCL Expressions on GitHub,” 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), Buenos Aires, 2017, pp. 531-534, https://doi.org/10.1109/MSR.2017.52.

[10] Jorge Perianez-Pascual, Roberto Rodriguez-Echeverria, Loli Burgueño, and Jordi Cabot. 2020. Towards the Optical Character Recognition of DSLs. In Proceedings of the 13th ACM SIGPLAN International Conference on Software Language Engineering (SLE ’20), November 16–17, 2020, Virtual, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3426425.3426937

Want to build better software faster?

Want to build better software faster?

Read about the latest trends on software modeling and low-code development

You have Successfully Subscribed!

Pin It on Pinterest

Share This