In a context of ever-increasing volumes of data and knowledge, both in quantity and in diversity (Big Data), the main objective of SemLIS is to bring back to users the power on their data. By users we mean any individual or group who has a strong interest over some data, and the need to exploit them in order to derive new knowledge and to take decisions. That includes tasks such as search, authoring, data mining, and business intelligence. Those data can range from the personal data of an individual to the information systems of large companies, through project management inside a team. We take a subjective view on “Big Data” where the complexity does not lie in efficiently performing a given task on a large volume of data (e.g., query evaluation), but in enabling users to perform tasks that could not be anticipated (e.g, query formulation). In that subjective view, “Big” only means an amount of data that is too large or too complex for users to grasp and analyze by hand or by simple needs (e.g., spreadsheets).
That main objective of bringing back to users the power on their data can be decomposed into five high-level objectives:
- AUTO: to make users autonomous and agile in the process of exploiting data and knowledge by avoiding intermediates (e.g., database administrators);
- SEM: to facilitate the semantic representation and alignment of heterogeneous and multi-source data;
- FLEX: to provide flexibility by enabling out-of-schema data acquisition, and continuous evolution of the data schema;
- CON: to provide control and confidence in the information system by promoting transparency and predictability of system actions;
- COLL: to support the collaborative acquisition and verification of data and knowledge.
Those objectives are the different facets of a same approach that targets user guidance as a trade-off between full automation (aka. artificial intelligence) and no automation (aka. programming). We are conscious that this set of objectives is ambitious but we think we can address them because we do not target the hard problems of full automation, and because we now have an effective design pattern, ACN (Abstract Conceptual Navigation), to encapsulate an expressive formal language into data-user interaction and natural language.
Scientific foundations and former results
A distinctive aspect of our team is the application of formal methods coming from software engineering and theoretical computer science (formal languages and grammars, logics, type theory, declarative programming languages, theorem proving) to artificial intelligence tasks (knowledge representation and reasoning, data mining, user-data interaction). This is explained by the combination of a theoretical background shared by permanent members and a real interest for data and their users.
We briefly describe the scientific foundations of the team, organized by high-level research topics, along with references to a few former contributions in each topic.
Knowledge Representation and Querying
The team uses symbolic approaches, and in particular the Semantic Web technologies. Indeed, those are an active research domain, and provide W3C standards for concepts introduced by widely recognized formalisms for knowledge representation: e.g., Datalog, description logics, or conceptual graphs. The Semantic Web defines languages for the representation of facts and rules (RDF, RDFS, OWL, SWRL), and for their querying (SPARQL). Moreover, the Semantic Web has an active community, both in academy and in industry. That research domain solicits competencies in formal languages (syntax and semantics), in logics, and in automated reasoning. We also study various kinds of logics such as epistemic logics, dynamic logics, or deontic logics. In those topics, we have for instance contributed to the modular composition of logics, to the representation of updates, and to the representation of complex expressions in RDF.
Natural Language Processing
Here again, the team uses symbolic approaches. One task is to extract structured and semantic information from texts. The employed techniques are: a) categorial grammars associating syntactic/semantic types to words, b) Montague grammars associating grammars, lambda calcul, and logic, and c) sequential patterns. Those techniques can be used for syntactic/semantic analysis of sentences, for Information Extraction (IE), and for defining Controlled Natural Languages (CNL). In those topics, we have for instance contributed to the learnability of pregroup grammars, and their extension with option and iteration, to a CNL (SQUALL) for querying and updating RDF graphs, and to the discovery of linguistic patterns from texts.
Symbolic Data Mining
The team has competencies in the conception and application of symbolic data mining algorithms, in particular for sequential patterns, and their application to texts. It also has competencies in learning the grammar of natural languages from a structured corpus. Moreover, the SemLIS team was scientifically founded on Formal Concept Analysis (FCA). It produced FCA-based contributions for data mining and machine learning, as well as for data exploration.
Because of the importance that we give to user-data interaction, the team invested into techniques that enable to structure and reason on those interactions. We can refer, in particular, to faceted search (often used in e-commerce platforms), On-Line Analytical Processing (OLAP, often used in business intelligence), Geographical Information Systems (GIS), and multi-agent systems. In those topics, we have for instance contributed to the exploration of geographical data, to the discovery of functional dependencies and association rules with OLAP cubes, and to the extension of faceted search to RDF graphs.
Although the above four topics correspond to traditionally distinct research domains and communities, they are often found in combinations in today’s research challenges and conferences. Many of our contributions actually lie at the crossing of several topics: e.g., the application of symbolic data mining to linguistic data such as texts, the interactive exploration and filtering of data mining results, the representation and querying of natural language resources such as lexicalized grammars, or the combination of a query language, natural language generation, and user-data interaction to help users explore the Semantic Web. We believe that all those topics are essential, and need to be combined, in order to achieve our objectives.