Encoding Class GrammarBasedInformationExtraction

Class Tree
Description SCENARIO
Recognizing and extracting meaningful information from unstructured Web documents is an important problem in information and knowledge management.
In the recent literature a number of approaches for information extraction from unstructured documents have been proposed.
Among them, the HiLex system [Ruffolo et al JELIA'06], based on ASP, is receiving quite some attention also in industry, and its industrial exploitation is the purpose of a joint-venture between an italian and US company from Chicago.
A strong point of HiLex is that its extraction method combines syntactic and SEMANTIC information, through the usage of domain ontologies.
A preprocessor transforms the input documents in ASP facts; extraction rules are translated into ASP; and information extraction amounts to reasoning on an ASP program, which is executed by DLV.

The present benchmark is "distilled" from a simple HiLex application.

PROBLEM SPECIFICATION
We are given the following context free grammar, which specifies arithmetic equations:

<equation> -> <expression> <equal> <sign_1> <number>
<expression> -> <sign_1> <term> <term_1>
<term_1> -> <sign> <term> <term_1> | epsilon
<term> -> <number> | <op_bracket> <expression> <cl_bracket>
<sign> -> <plus> | <minus>
<sign_1> -> <sign> | epsilon
<number> -> <digit> <number_1>
<number_1> -> <digit> <number_1> | epsilon
<plus> -> "+"
<minus> -> "-"
<equal> -> "="
<op_bracket> -> "("
<cl_bracket> -> ")"
<digit> -> "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

The nonterminals, especially <equation>, <expression>, <term>, <number>, have an associated value, usually representing a numerical value, and in the case of <equation> a binary value.

Given a string (sequence of terminals), the problem is determining whether it belongs to the language defined by this grammar (starting from <equation>) and if so, whether its associated value is "true". So, there should be an answer set exactly if the input string is a valid equation and the equation holds.

For instance, given the string "-((-2+5)+1)=-4", there should be an answer set, while given either of the strings "--2+5+1=-4" or "1+1=5", there should not be an answer set.

The instances for this problem contain a fact for each character of the string. The first argument is a string of length 1, and the second is its position (starting from 1). The instance "-((-2+5)+1)=-4" from above would be represented by the following facts:

char("-",1).
char("(",2).
char("(",3).
char("-",4).
char("2",5).
char("+",6).
char("5",7).
char(")",8).
char("+",9).
char("1",10).
char(")",11).
char("=",12).
char("-",13).
char("4",14).

In the examples, 1000000 (one million) is an upper bound for the numerical values that may occur at any point (so also each intermediate result may be at most one million). Also the length of the input string will not be longer than one million characters.

As for output, we only require that the predicate sol (without arguments) occurs in the answer set.

Author: Marco Manna
Encodings
1 - 2 of 2