•
We present an algorithm, FLASHRELATE, that syn-
thesizes FLARE programs given a small number of
user-provided positive and negative examples (see §4 ).
FLASHRELATE’s effectiveness (in speed and number of
required examples) is dictated by its efficient constraint
search.
•
We empirically evaluate the expressiveness of the FLARE
and the effectiveness of FLASHRELATE against a set
of 43 real-world spreadsheets drawn from the EUSES
corpus and from Excel user help forums [11, 16]. We
show that the FLASHRELATE algorithm is able to syn-
thesize correct FLARE programs from examples in all but
one case. Only a small number of examples are required
(see §5 ). FLASHRELATE rarely takes longer than 2 sec-
onds.
2. Related Work
Much recent research considers the problem of extract-
ing structured data from unstructured or semi-structured
sources, including documents on the web (e.g, [4, 9, 21, 22]).
Extracting Data from the Web Another important related
body of work focuses on extracting relational data from data
on the web. SXPath [22] is a query language, like FLARE,
that uses a combination of spatial relations and path expres-
sions to extract data from complex web documents. SXPath
uses intuitive spatial relations in describing patterns to avoid
queries that involve complex, non-intuitive deep path ex-
pressions. SXPath, like FLARE, includes spatial primitives
in its queries, but does not attempt to synthesize programs
in the query language from examples, as we do. SILA [21]
defines a spatial DOM abstraction (PDOM) and proposes
an algorithm for automatically extracting data records from
deep web pages based on the PDOM. While, like FLASHRE-
LATE, SILA attempts to extract records from spatially struc-
tured data, it does so algorithmically, and not by defining a
domain-specific query language. For overviews of the range
of approaches taken to web data extraction, see Ferrera et
al. [9] and Cafarella et al. [4]
Extracting Data from Spreadsheets The research most
closely related to FLASHRELATE is the concurrently-
developed SENBAZURU project, which, like FLASHRE-
LATE, seeks to extract relational data from spreadsheets [5,
6]. While the project goals are similar, the approach taken
by FLASHRELATE is very different. SENBAZURU attempts
to automatically infer hierarchical structure in spreadsheets
by creating a classifier that identifies data frames in the docu-
ment and another classifier that infers the intended hierarchy
based on a set of predefined features. In contrast, FLASHRE-
LATE uses positive and negative output examples to synthe-
size a program in a domain-specific language of our own
design. FLASHRELATE can be used to perform arbitrary ex-
traction tasks from arbitrary spreadsheets, provided that reg-
ular syntactic and spatial structure is present.
Cunha et al. [7] also consider extracting relational data
from spreadsheets, but their focus is on recovering the true
relational schema from the spreadsheet data. We believe that
a user might have in mind many different task-dependent
schemas for a single spreadsheet, as the example in the
introduction illustrates. Instead, FLASHRELATE crafts an
extraction program that returns precisely those tuples that
the user wants based a set of user-supplied examples.
Query Synthesis by Examples The view synthesis prob-
lem [8, 26] aims to find the most succinct and accurate query
for a given database view. There are two key differences with
our work: (i) View synthesis techniques infer a relation from
multiple single-dimensional relational tables, while we in-
fer a relation from a single, two-dimensional semi-structured
spreadsheet. Such spreadsheets are often used to encode
multiple dimensions in ad-hoc ways. (ii) View synthesis
techniques infer a relation from a large representative exam-
ple view, while we infer a transformation from a set of few
example rows, an important usability aspect for end-users.
Programming by Examples The area of programming by
examples [18] (PBE) is gaining renewed interest [14] be-
cause of its revolutionary potential to enhance productiv-
ity of millions of end-users. Gulwani et al. have developed
programming-by-example techniques for automating repet-
itive data manipulation tasks related to structured spread-
sheet tables [15]. These include syntactic string transforma-
tions [13], semantic string transformations [25], and table
layout transformations [16]. Of these, the most closely re-
lated work is that of [16]. The following are some key dif-
ferences with our work: (i) We address a different class of
spreadsheet tasks, namely transforming semi-structured data
into structured relational data. (This facilitates application of
prior work on manipulating structured relational data using
input-output examples.) For instance, [16] cannot handle any
of the transformation tasks associated with our new bench-
mark examples. (ii) The synthesis techniques used are com-
pletely different. Prior work uses a class of techniques called
version-space algebras, while we perform heuristic search.
(iii) The user interaction is quite different. Prior work takes
as input multiple input-output examples, while our technique
takes as input multiple positive/negative examples of tuples
in the desired output table.
Quicksilver [19] is another recent PBE technology for
structured relational data. It synthesizes relational algebra
queries over strictly relational tables, while we focus on a
different class of spreadsheet tasks, namely extracting rela-
tional tuples from semi-structured spreadsheets. QuickSil-
ver cannot handle any of the transformation tasks associated
with our benchmark examples.
Manipulation of Semi-Structured Data The PADS project
simplifies ad hoc data-processing tasks for programmers by
developing domain-specific languages for describing data
formats in text files and learning algorithms for inferring
FLASHRELATE 4 2014/4/23