FLASHRELATE: Extracting Relational Data from Semi-Structured

FLASHRELATE: Extracting Relational Data from

Semi-Structured Spreadsheets Using Examples

Daniel W. Barowy

University of Massachusetts Amherst

dba[email protected]

Sumit Gulwani Ted Hart

Benjamin Zorn

Microsoft Research

{sumitg, ted.hart, ben.zorn}@microsoft.com

Abstract

Spreadsheets store a tremendous amount of important data.

One reason spreadsheets are so successful is that they are

both easy to use and allow users great expressiveness in

storing and manipulating their data. This ﬂexibility comes

at a price, as presentation elements are often combined with

the underlying data model. As a result, many spreadsheets

contain data in ad-hoc formats. These formats complicate

the use of traditional relational tools which require data in

a normalized form. Normalizing data from these formats is

often tedious or requires programming, and often, a user may

prefer the original presentation.

We describe an approach that allows users to easily ex-

tract structured data from spreadsheets without program-

ming. We make two contributions. First, we describe a novel

domain speciﬁc language called FLARE that extends tra-

ditional regular expressions with spatial constraints. Sec-

ond, we describe an algorithm called FLASHRELATE that

can synthesize FLARE programs from user-provided positive

and negative examples. Using 43 benchmarks drawn both

from a standard spreadsheet corpus and from Excel user-help

forums, we demonstrate that correct extraction programs can

be synthesized quickly from a small number of examples.

Our approach generalizes to many data-cleaning tasks on

semi-structured spreadsheets.

Categories and Subject Descriptors CR-number [subcat-

egory]: third-level

Keywords Program Synthesis, Domain Speciﬁc Languages,

Heuristic Search, Spreadsheets

1. Introduction

There are an estimated 500 million Microsoft Excel users

worldwide [24]. It is thus not surprising that a tremendous

amount of important data is stored in spreadsheets.

As with other document types such as text ﬁles and web

pages, spreadsheets combine their data model and view. This

ﬂexibility gives the spreadsheet creator a large degree of

freedom in encoding their data. It also complicates the use

of powerful database tools (e.g., relational queries) which

expect data in a particular form. Although spreadsheets are

tabular and ostensibly tables, end-users may organize their

data in any two-dimensional form that they ﬁnd convenient.

This problem is widespread. Recent research suggests

that the vast majority of spreadsheets available on the web

cannot be trivially converted to database relations [6]. We

argue that the cause is due to the multidimensionality of data.

Representing multidimensional data in a two-dimensional

spreadsheet presents a fundamental problem for users, since

additional dimensions need to be projected into a 2D grid.

Spreadsheet authors often solve this problem by crafting

clever encodings of their data.

While expert users might recognize when database tech-

nologies might be more appropriate, they do so because

these technologies facilitate automatic processing. In con-

trast, spreadsheets allow compact and intuitive visual rep-

resentations of data better suited for human understanding.

Since these representations vary from spreadsheet to spread-

sheet, each unique encoding must be decoded if they are to

be used with more powerful data tools.

The conﬂating of presentation information with data is

commonplace in many domains. The recognition that non-

experts are unlikely to embrace data management tools de-

signed for automatic processing has lead to the development

of numerous domain-speciﬁc technologies for data extrac-

tion. Scripting languages like Perl, Awk, and Python have

been designed to support string processing in text ﬁles. Web

technologies like XQuery, HTQL, XSLT, and SXPath can be

used to extract data from webpages. There are no such tools

for spreadsheets.

Example To make our contributions more concrete, we re-

fer to the example showing in Fig. 1(a), a real spreadsheet

taken from the EUSES corpus [11]. The spreadsheet shows

timber harvests by country and year. It also packs data about

a particular country into value/year pairs within a row, ap-

pending comments to the far right. The original author prob-

ably structured the data in this form to avoid having the data

in a long thin column, which would be harder to read.

Consider the steps that a programmer must undertake to

compute the average harvest in 1950 from this representa-

FLASHRELATE 1 2014/4/23

A B C D E . . . R

1 value year value year Comments

2 Albania 1,000 1950 930 1981 FRA 1

3 Austria 3,139 1951 3,177 1955 FRA 3

4 Belgium 541 1947 601 1950

5 Bulgaria 2,964 1947 3,259 1958 FRA 1

6 Czech . . . 2,416 1950 2,503 1960 NC

. . .

(a)

A B C D

1 Albania 1,000 1950 FRA 1

2 Albania 930 1981 FRA 1

. . .

5 Austria 3,139 1951 FRA 3

6 Austria 3,177 1955 FRA 3

. . .

9 Belgium 541 1947

10 Belgium 601 1950

. . .

(b)

Figure 1: (a) A semi-structured spreadsheet with two example tuples highlighted. The ﬁrst tuple (red) represents the timber

harvest (per 1000 hectares) for Albania in 1950. The second tuple (blue) represents the timber harvest for Austria in 1950. (b)

An extracted relational table with the same two tuples highlighted as in Fig. 1a

tion. The spreadsheet author’s encoding, while perhaps eas-

ier to read, makes this task a challenge. The ﬁrst thought

of such a programmer might be to convert the data into a

form better suited to performing the computation, like the

relational table shown in Fig. 1(b). One approach would be

to use a scripting language like Perl or Visual Basic to de-

ﬁne regular expressions that would match the elements in

each column. For example, one could match country names

with an alphanumeric pattern, years with 4-digit patterns,

etc. Having done this, matches would be spatially collated

to form tuples, and tuples collated to form a table. We show

one such program in Fig. ??.

Problem Statement The problem that we address in this

paper is how to enable ordinary spreadsheet users, who do

not have programming skills, to convert ad hoc data for-

mats into relational ones. The strategy we use to achieve

this goal has two parts. 1) We present a domain-speciﬁc lan-

guage (DSL) called FLARE for extracting relevant data from

spreadsheets in a relational format. 2) We also present an

algorithm called FLASHRELATE that automatically gener-

ates FLARE programs from a small number of user-provided

positive and negative examples of tuples in the desired out-

put table, greatly simplifying the process of automated data

extraction.

FLARE Query Language FLARE describes the regular ge-

ometric structure of ad-hoc encodings to map spreadsheet

data into relational tables. The design of FLARE is inspired

by scripting languages with regular expression capabilities,

which have enabled developers with to extract relational data

from text ﬁles, such as server logs. Spreadsheets represent

another major source of semi-structured data. Without sup-

porting code, regular expressions and other string-matching

tools are not expressive enough to capture relational in-

formation encoded in spreadsheets. These structures, such

as hierarchical data, become complex geometric structures

when projected into a two-dimensional grid.

FLARE converts ad hoc stuctures into relational ones

so that relational tools can use them. FLARE declaratively

speciﬁes the structure of ad-hoc encodings in the form of

constraints. Data that matches these constraints are automat-

ically converted into relational tables.

FLARE has two kinds of constraints. Constraints over

the text within a spreadsheet cell are referred to as cell

constraints and are composed of regular expressions. Cell

constraints represent valid classes of values in a relational

table, i.e., a relational column. Constraints between cells

are referred to as spatial constraints and are composed of

geometric relationships. Spatial constraints represent valid

geometric relationships between values that belong in the

same relational tuple.

Taken together, these two kinds of constraints form a di-

rected constraint graph over the spreadsheet. Vertices consist

of cell constraints while edges consist of spatial constraints.

The cells matched by a traversal of this graph produce a set

of relational tuples in the desired output table. We show a

sample FLARE query and corresponding constraint graph in

Fig. 3. Intuitively, one can think of this graph as an invariant

geometric structure that, when translated over a spreadsheet,

indicates which cells form a relational tuple.

FLASHRELATE Synthesis Algorithm Programmatic so-

lutions to data extraction suffer two key limitations. First,

the expertise required to use these tools is often particular

to speciﬁc document types. Second, and more signiﬁcantly,

they require knowledge of programming. The ﬁrst aspect

creates challenges even for programmers, while the second

aspect puts these solutions out of reach of the vast majority

of end users. As a result, users are either unable to leverage

access to rich data or have to resort to manual copy-and-

paste, which is both time-consuming and error-prone.

FLASHRELATE is a complementary technology that

bridges the expertise gap between ordinary end-users and

sophisticated new query languages like FLARE. FLASHRE-

LATE infers the correct set of constraints from a small num-

ber of examples given by the user and outputs an appropriate

FLARE program.

Example 1: Returning to our example in Fig. 1, we show

how a FLARE program can concisely extract the neces-

sary relational table. Recall that the user would like to

FLASHRELATE 2 2014/4/23

Fun ctio n Extract () As Colle ctio n

...

Set T uples = New Col lecti on ’ R e s ults

rYear . Pattern = " ^1 9[0 -9]{2} $ " ’ Y ear P att .

rValue . Patt ern = " ^[0 -9]+ $" ’ V a lue P att .

’ S e arch f or T u ples

For Each ws In Wor kshee ts

For Each cell In ws. Use dRang e

x = cell . Column

y = cell . Row

x_rt = x + 1

If rYear . Test ( cell . Value ) _

And r Value . Test (_

ws . Cells (y , x_rt ). Value ) Then

Dim tup leCoor ds

tupl eCoor ds = Array ( ws . Index , x , _

y , x_rt , y )

Tuples . Add ( cel lCoor ds )

End If

Ext ract = Tu ples

End Functio n

Figure 2: A Visual Basic program that performs the desired

extraction in Example 1. Variable declarations were omitted

for space reasons. The equivalent FLARE program, shown

in Fig. 3a, declaratively speciﬁes the extraction and is much

simpler.

compute the average timber harvest for 1950. The FLARE

program shown in Figure 3 performs this extraction. Once

the data is in the desired form, the desired result can be

computed trivially with the following SQL query: SELECT

AVG(column 1) FROM FlareTable WHERE column 2 =

1950.

Despite the difﬁculty of computing the result from the

original structure, a typical person would have no trouble

understanding the spreadsheet. Numerous geometric cues

guide a person’s eye to the right location. For example,

timber harvest values are always located under a heading

titled “value.” Year values are always located to the right

of the timber value. Country names are always located to

the far left. Although other geometric information may be

used to perform the same task, these three invariants are

sufﬁcient to extract the desired tuple. FLARE makes use of

these intuitive geometric concepts to concisely express the

desired extraction.

In Figure 3(a), Node constraints identify the two columns

in the desired relation, (1) the value, and (2) the year, labeled

with column IDs 1 and 2, respectively. Node constraints

are based on regular expressions. The ﬁrst Node constraint

includes an extra contextual constraint, called an anchor,

that requires the presence of a second, contextual cell. In this

program, that additional cell must contain the string “value”

somewhere above Node 1 in the input spreadsheet.

Note that Node 2 is a strict superset of Node 1. How then,

are “value” cells to be distinguished from “year” cells? Edge

Node(1, “^[0-9]+$”, Anchor(“value”, Vert(*), Horiz(0)))

Node(2, “^[0-9]+$”, ⊥)

Edge (1, 2, Vert(0), Horiz(1), Select(All, All))

(a)

2 1

(b)

Figure 3: (a) FLARE program for ﬁrst example extraction

task with (b) schematic illustration. Node constraints are

shown as dots, Edge constraints are shown as solid arrows,

and Anchor constraints are shown with a dashed arrow and

anchor symbol. Edge s and Anchors of non-constant length

are labeled with a Kleene star. Node numbers correspond to

attribute IDs in the desired relational tuple.

Node(3, “^[a-zA-Z ]+$”, ⊥)

Node(4, “^[a-zA-Z ]+$”, ⊥)

Edge (3, 1, Vert(0), Horiz(*), Select(All, All))

Edge (2, 4, Vert(0), Horiz(*), Select(All, All))

(a)

2 1 3

(b)

Figure 4: (a) A set of additional constraints added to Fig-

ure 3a; (b) schematic illustration.

constraints provide the additional context required by de-

scribing the spatial relationships between Nodes. Year values

must be one cell to the right of harvest values.

Example 2: Another task might be to ﬁnd comments for

countries reporting harvests after 1960. This new extraction,

displayed as a set of additional constraints appended to our

original program, is shown in Figure 4. The extracted table

corresponds to the one shown in Fig. 1b. The task can be

completed by running the following SQL query against the

extracted table: SELECT DISTINCT column 3,column 4

FROM FlareTable WHERE column 2 > 1960.

Contributions This paper makes the following contribu-

tions:

•

We describe a domain-speciﬁc query language, FLARE,

that allows extraction of relational data from two-dimensional

semi-structured spreadsheets (see §3 ). FLARE is the ﬁrst

pattern-based spatial query language explicitly designed

to extract relational data from spreadsheets.

FLASHRELATE 3 2014/4/23

•

We present an algorithm, FLASHRELATE, that syn-

thesizes FLARE programs given a small number of

user-provided positive and negative examples (see §4 ).

FLASHRELATE’s effectiveness (in speed and number of

required examples) is dictated by its efﬁcient constraint

search.

•

We empirically evaluate the expressiveness of the FLARE

and the effectiveness of FLASHRELATE against a set

of 43 real-world spreadsheets drawn from the EUSES

corpus and from Excel user help forums [11, 16]. We

show that the FLASHRELATE algorithm is able to syn-

thesize correct FLARE programs from examples in all but

one case. Only a small number of examples are required

(see §5 ). FLASHRELATE rarely takes longer than 2 sec-

onds.

2. Related Work

Much recent research considers the problem of extract-

ing structured data from unstructured or semi-structured

sources, including documents on the web (e.g, [4, 9, 21, 22]).

Extracting Data from the Web Another important related

body of work focuses on extracting relational data from data

on the web. SXPath [22] is a query language, like FLARE,

that uses a combination of spatial relations and path expres-

sions to extract data from complex web documents. SXPath

uses intuitive spatial relations in describing patterns to avoid

queries that involve complex, non-intuitive deep path ex-

pressions. SXPath, like FLARE, includes spatial primitives

in its queries, but does not attempt to synthesize programs

in the query language from examples, as we do. SILA [21]

deﬁnes a spatial DOM abstraction (PDOM) and proposes

an algorithm for automatically extracting data records from

deep web pages based on the PDOM. While, like FLASHRE-

LATE, SILA attempts to extract records from spatially struc-

tured data, it does so algorithmically, and not by deﬁning a

domain-speciﬁc query language. For overviews of the range

of approaches taken to web data extraction, see Ferrera et

al. [9] and Cafarella et al. [4]

Extracting Data from Spreadsheets The research most

closely related to FLASHRELATE is the concurrently-

developed SENBAZURU project, which, like FLASHRE-

LATE, seeks to extract relational data from spreadsheets [5,

6]. While the project goals are similar, the approach taken

by FLASHRELATE is very different. SENBAZURU attempts

to automatically infer hierarchical structure in spreadsheets

by creating a classiﬁer that identiﬁes data frames in the docu-

ment and another classiﬁer that infers the intended hierarchy

based on a set of predeﬁned features. In contrast, FLASHRE-

LATE uses positive and negative output examples to synthe-

size a program in a domain-speciﬁc language of our own

design. FLASHRELATE can be used to perform arbitrary ex-

traction tasks from arbitrary spreadsheets, provided that reg-

ular syntactic and spatial structure is present.

Cunha et al. [7] also consider extracting relational data

from spreadsheets, but their focus is on recovering the true

relational schema from the spreadsheet data. We believe that

a user might have in mind many different task-dependent

schemas for a single spreadsheet, as the example in the

introduction illustrates. Instead, FLASHRELATE crafts an

extraction program that returns precisely those tuples that

the user wants based a set of user-supplied examples.

Query Synthesis by Examples The view synthesis prob-

lem [8, 26] aims to ﬁnd the most succinct and accurate query

for a given database view. There are two key differences with

our work: (i) View synthesis techniques infer a relation from

multiple single-dimensional relational tables, while we in-

fer a relation from a single, two-dimensional semi-structured

spreadsheet. Such spreadsheets are often used to encode

multiple dimensions in ad-hoc ways. (ii) View synthesis

techniques infer a relation from a large representative exam-

ple view, while we infer a transformation from a set of few

example rows, an important usability aspect for end-users.

Programming by Examples The area of programming by

examples [18] (PBE) is gaining renewed interest [14] be-

cause of its revolutionary potential to enhance productiv-

ity of millions of end-users. Gulwani et al. have developed

programming-by-example techniques for automating repet-

itive data manipulation tasks related to structured spread-

sheet tables [15]. These include syntactic string transforma-

tions [13], semantic string transformations [25], and table

layout transformations [16]. Of these, the most closely re-

lated work is that of [16]. The following are some key dif-

ferences with our work: (i) We address a different class of

spreadsheet tasks, namely transforming semi-structured data

into structured relational data. (This facilitates application of

prior work on manipulating structured relational data using

input-output examples.) For instance, [16] cannot handle any

of the transformation tasks associated with our new bench-

mark examples. (ii) The synthesis techniques used are com-

pletely different. Prior work uses a class of techniques called

version-space algebras, while we perform heuristic search.

(iii) The user interaction is quite different. Prior work takes

as input multiple input-output examples, while our technique

takes as input multiple positive/negative examples of tuples

in the desired output table.

Quicksilver [19] is another recent PBE technology for

structured relational data. It synthesizes relational algebra

queries over strictly relational tables, while we focus on a

different class of spreadsheet tasks, namely extracting rela-

tional tuples from semi-structured spreadsheets. QuickSil-

ver cannot handle any of the transformation tasks associated

with our benchmark examples.

Manipulation of Semi-Structured Data The PADS project

simpliﬁes ad hoc data-processing tasks for programmers by

developing domain-speciﬁc languages for describing data

formats in text ﬁles and learning algorithms for inferring

FLASHRELATE 4 2014/4/23

such formats using annotations [10]. The learned format can

then be used by programmers to implement custom data-

analysis tools. PADS focuses on parsing and manipulating

semi-structured data in text ﬁles or log ﬁles while we focus

on semi-structured data in spreadsheets.

Learning Theory Angluin presented a classical algorithm

for learning regular expressions or ﬁnite state automatas

from positive and negative examples [2]—an equivalence

oracle produces a counterexample that exhibits disequiva-

lence of the current hypothesis with the intended one, and

a membership oracle labels that counterexample as either

positive or negative. Our work learns a more sophisticated

concept, namely a two-dimensional query involving regular

expressions from positive/negative examples. In our case, it

is the end-user who plays the role of both equivalence and

membership oracles, providing a positive or negative tuple

row(s) in each interaction.

Two-Dimensional Grammars The formal methods com-

munity has done a lot of work on deﬁning picture gram-

mars [12, 23], which extend classical grammars for gen-

erating strings into two-dimensional space, using concepts

like two-dimensional regular expressions [3, 20] and au-

tomata [17]. These works focus on theoretical analysis

of two-dimensional grammars, speciﬁcally decidability and

complexity of classical problems for these novel languages.

In contrast, a FLARE program is speciﬁcally designed for

matching relational tuples. Furthermore, we address the

problem of synthesizing the desired program from few posi-

tive/negative examples.

Header Inference [1] describes a system that automati-

cally infers header information in spreadsheets by exploiting

special layout or formatting attributes, such as sub-headers,

footers, ﬁller cells (blank cells or cells with some special

characters to aid visual readability of table content). Our

work also leverages the presence of such layout or format-

ting constraints in spreadsheets; however our use of this in-

formation is not heuristic in nature—users provide examples

for tuple extraction and we learn a speciﬁc pattern-based log-

ical query for accomplishing the intended task.

3. Flare Language

The goal of a FLARE program is to transform a semi-

structured spreadsheet I into an n-ary relational table. The

syntax and formal semantics of FLARE are shown in Fig. 5.

A spreadsheet I is a two-dimensional collection of strings.

A cell c is a pair of x and y coordinates that can be used

to index I (as in I[c]). We use the term Cells(I) to de-

note all cells in the used range of the spreadsheet I. An

n-ary relational table is a set of n-ary tuples of strings.

Each such tuple can also be represented as a map from tu-

ple attribute indices {1, . . . , n} to spreadsheet strings (as in

{1 7→ s

, 2 7→ s

, . . . , n 7→ s

}).

A FLARE program ρ consists of a set of n nodes and a

set of directed edges that form a tree shape. These nodes and

edges are labeled with descriptions that form the criteria for

extracting relational tuples from the input sheet. Each node

in P corresponds to a unique tuple attribute index i from the

output relational table. It is associated with a cell constraint

α (denoted by CellConstraint(i)) that is a Boolean con-

straint over spreadsheet cells. Each directed edge in P cor-

responds to an ordered pair of tuple attribute indices j and k

from the output relational table. Each such edge is associated

with two kinds of constraints: (a) a spatial constraint β (de-

noted by SpatialConstraint(j, k)), which is a Boolean

constraint over an ordered pair of spreadsheet cells and

checks horizontal/vertical orientation of the two cells. (b) a

select constraint γ (denoted by SelectConstraint(j, k)),

which ﬁlters a set of cells with respect to another cell based

on some distance related tags. We discuss the precise syntax

and semantics of these different kinds of constraints later in

this section. We use the term Root(ρ) to denote the tuple

attribute index corresponding to the root node in the tree

structure formed by the nodes and edges in ρ.

In order to deﬁne a constructive semantics for a FLARE

program ρ, we introduce a recursive function F (j, k, c) that

takes as input an edge identiﬁer (i.e., a pair of tuple attribute

indices j and k) and a cell c and returns a set of m-tuples,

where m is the number of nodes reachable from the corre-

sponding edge.

F (j, k, c) is computed in three steps: (i) Compute the set

of all cells that simultaneously satisfy both the spatial

relationship to c with SpatialConstraint(j, k) and the

cell pattern CellConstraint(k). (ii) Filter C

to that subset

that satisﬁes the SelectConstraint(j, k) relationship

with c. (iii) For each c

∈ C

, compute the cross product

of singleton c

with the result of the recursive invocation of

F along each outgoing edge from nodes with tuple attribute

index k and cell c

(i.e., F (k, k

, c

), where k

belongs to

the set of children of node k (denoted by Children(ρ, k))),

and return their union.

The execution of a program ρ on an input spreadsheet

I proceeds as follows: (i) Compute the set C of all cells

that satisfy the cell constraint CellConstraint(`), where

` is the tuple attribute index associated with the root node of

program- ρ. (ii) For each c ∈ C, compute the cross product

of singleton c with the result of recursive invocations of F

along the outgoing edges from the root node and cell c (i.e.,

F (`, k

, c)), and return their union.

A cell constraint Cell(r, ) consists of a regular expres-

sion r, namely a Boolean constraint over strings, and an an-

chor constraint , namely a Boolean constraint over cells.

An anchor constraint Anchor(r,β), which consists of a reg-

ular expression r and a spatial constraint β, asserts that there

exists a cell c

whose content matches r and that is related

to the argument cell using the spatial constraint β. Anchor

constraints can be thought of as a special case of the spatial

FLASHRELATE 5 2014/4/23

Program ρ := {N

}

∪ {E

}

Node N := Node(j, α)

Edge E := Edge(j, k, β, γ)

Cell Constraint α := Cell(r, )

Spatial Constraint β := Spatial(v,h)

Vertical Constraint v := Vert(q)



Horiz. Constraint h := Horiz(q)



Quantity q := . . .



-2



-1



. . .



- ∗



∗

Select Constraint γ := Select(T

, T

)

Type T := All



NearX



NearY



FarX



FarY

Anchor Constraint := Anchor(r,β)



⊥

(a)

Cell: (N, N)

Spreadsheet: Cell → String

Program: Spreadsheet → n-tuple set

Cell/Anchor Constraint: Cell → Bool

Spatial/Vert/Horiz Constraint: (Cell, Cell) → Bool

Select Constraint: (Cell, Cell set) → Cell set

(b)

Node(j, r, ) ≡ Node(j, Cell(r, ))

Edge(j, k, v, h, γ) ≡ Edge(j, k, Spatial(v, h), γ)

Anchor(r, v, h) ≡ Anchor(r, Spatial(v, h))

All ≡ Select(All, All)

(c)

[[ρ]] I = {{(` 7→ I[c])} ∪ σ

∪ . . . ∪ σ

| c ∈ C, σ

∈ F (`, k

, c)}

where C = {c | [[α]] c, c ∈ Cells(I)}, ` = Root(ρ)

α = CellConstraint(`), {k

, . . . , k

} = Children(ρ, `)

F (j, k, c) = {{(k 7→ I[c

])} ∪ σ

∪ . . . ∪ σ

| c

∈ C

, σ

∈ F (k, k

, c

)}

where C

= [[γ]] (c, C

), C

= {c

| [[β]] (c, c

) ∧ [[α]] c

α = CellConstraint(k), β = SpatialConstraint(j, k)

γ = SelectConstraint(j, k), {k

, . . . , k

} = Children(ρ, k)

[[Cell(r, )]] (I, c) = [[r]] I[c] ∧ [[ ]] (I, c)

[[Spatial(v, h)]] (c, c

) = [[v]] (c, c

) ∧ [[h]] (c, c

)

[[Horiz(q)]] ( c, c

) = (fst(c) − fst(c

)) ∈ Range(q)

[[Vert(q)]] ( c, c

) = (snd(c) − snd(c

)) ∈ Range(q)

where Range(-∗) = {−1, −2, −3, . . .}

Range(∗) = {1, 2, 3, . . .}

Range(i) = {i}

[[>]] (c, c

) = true

[[Select(T

, T

)]] (c, C) = [[T

]] (c, [[T

]] (c, C))

[[All]] (c, C) = C

[[NearX]] (c, C) = argmin

∈C

|fst(c) − fst(c

[[Anchor(r,β)]] I, c = ∃c

([[r]] I[c

] ∧ [[β]] (c, c

))

[[⊥]] (I, c) = true

[[r]] s = true iff regular expression r

matches string s

(d)

Figure 5: (a) Syntax and (b) types of a FLARE program ρ. We refer to ∗ as a Kleene star. (c) For notational convenience, we use

the shorthand syntax on the left to denote the expressions on the right. (d) Semantics of a FLARE program ρ on spreadsheet I.

x, y, and z are Coordinates. c is a Cell. C is a set of Cells. a and b are integers, and v is a Boolean.

and cell constraints. Anchor constraints serve as a Boolean

predicate but do not extract a column in the output tuple.

A spatial constraint Spatial(v, h) consists of a pair of

constraints: a vertical constraint v, and a horizontal con-

straint h. Both forms are Boolean constraints over a pair of

cells and check whether the two cells have the speciﬁed ver-

tical and horizontal orientation, respectively. Both forms are

parameterized by a quantity argument q. Kleene (“*”) quan-

tities can match any cell in a particular direction, whereas

constant quantities only match cells in a particular location.

The select constraint γ consists of a pair of enumerated

Type tags that control the behavior of Kleene quantities.

A tag Type is either All, NearX, NearY, FarX, or FarY,

each of which is a ﬁlter over a set of cells C with respect to

another cell c. In particular, the tag NearX selects those cells

from C that have the shortest horizontal distance from the

cell c. Other tags are deﬁned similarly. A select constraint

will typically contain at most one constraint in each direction

(i.e., at most one of NearX and FarX, and at most one of

NearY and FarY).

3.1 Program Execution

In this section, we walk through the execution of a FLARE

program and discuss our early experience writing programs

for extracting data from spreadsheets. Because FLARE

builds on regular expressions, the additional concepts we

present are relatively easy to learn.

For our example, we return to the input spreadsheet

shown in Fig. 1a and the desired relational table output

FLASHRELATE 6 2014/4/23

Node(1,“^[a-zA-Z ]+$”,⊥)

Node(2,“^[0-9]+$”,Anchor(“value”,Vert( ∗ ),>))

Node(3,“^19[0-9]2$”,⊥)

Node(4,“^(FRA [1-3]|NC|)$”,

Anchor(“Comments”,Vert( ∗ ),>))

Edge(1,2,Vert(0),Horiz( ∗ ),Select(All, All))

Edge(2,3,Vert(0),Horiz(1),Select(All, All))

Edge(3,4,Vert(0),Horiz( ∗ ),Select(All, All))

(a) Shown as a FLARE program.

3 2 1

* * *

(b) Shown as a tree.

Figure 6: The program for the running example.

shown in Fig. 1b. A FLARE program that accomplishes this

is shown in Fig. 6a and is illustrated schematically in Fig. 6b.

The output of a FLARE program should be thought of

as the simultaneous solution of its set of constraints on

a spreadsheet, however an operational description of the

interpreter helps to further understanding.

FLARE programs must be tree-shaped and all nodes must

be reachable via directed links from the root. This formal-

ism makes the recursive execution procedure described ear-

lier well deﬁned. Every node except the root node is paired

with one edge for which that node is the destination. Each

node corresponds a constraint over the contents of cells in

a column of the desired relational table. Each edge corre-

sponds to a spatial relationship between the cells of two out-

put columns. The interpreter loop processes each of these

constraint pairs starting from the tree root’s cell constraint.

When evaluating the root constraint, all cells in the spread-

sheet are considered. Cells ﬁltered by the root constraint are

candidates for inclusion in the root column (which is not

necessarily the ﬁrst column). Note that inclusion of a cell

in the ﬁnal output is predicated on all subsequent constraints

for each cell being met.

For subsequent nodes and edges, the interpreter recur-

sively returns a new set of cells satisfying each additional

constraint pair. The result become candidates for the at-

tribute named by the constraint pair’s ID.

In our example, the following root node is ﬁrst evaluated.

Node(1,“^[a-zA-Z ]+$”,⊥)

The cell constraint ﬁlters the entire input spreadsheet by

applying the regular expression, ^[a-zA-Z ]+$, and then

by ensuring that all cells that match the expression also

have neighbors that satisfy the anchor expression, ⊥. The ⊥

anchor is a special case meaning that no anchor expression

is given, thus evaluating the ⊥ anchor for any cell in any

spreadsheet always returns true. After evaluating the root

A B C D E . . . R

1 value year value year Comments

2 Albania 1,000 1950 930 1981 FRA 1

3 Austria 3,139 1951 3,177 1955 FRA 3

4 Belgium 541 1947 601 1950

5 Bulgaria 2,964 1947 3,259 1958 FRA 1

6 Czech . . . 2,416 1950 2,503 1960 NC

. . .

Figure 7: The set of cells (highlighted) from I that satisfy

the root constraint in the running example.

node, the output column with attribute ID 1 contains the

cells highlighted in Fig. 7. Our intent was to select the cells

belonging to the ﬁrst column, i.e., country names. Clearly,

this constraint is insufﬁciently precise; however, additional

constraints will eliminate these unwanted cells.

The cell constraint for column 2 includes an anchor in

addition to its regular expression:

Node(2,“^[0-9]+$”,Anchor(“value”,Vert( ∗ ),>))

First, cells that do not match “^[0-9]+$” are eliminated

(note that FLARE matches regular expressions against the

raw text values in a spreadsheet before Excel’s formatting

rules are applied). Next, only the cells that satisfy the an-

chor predicate are returned. This anchor states that a cell

containing the text “value” must appear somewhere above

the match.

The following edge joins columns 1 and 2:

Edge(1,2,Vert(0),Horiz( ∗ ),All)

The spatial constraint states that cells identiﬁed by the

cell constraint for attribute 2 must be anywhere to the right

and not above or below the cells identiﬁed by the cell con-

straint for attribute 1. This eliminates the unwanted cells for

attribute 1 described earlier.

The interpreter processes all remaining constraints in the

same manner. Every parent cell in the program tree is joined

with each of its child cells, thus yielding a relational n-tuple.

This program yields the output shown in Fig. 1b.

The worst-case complexity of a FLARE program, in terms

of the number of cells, n, and number of columns, c, is

O(n

), when all edges use Kleene stars. In practice, com-

plexity is low. Tuples with very large numbers of columns

are rare. Encodings rarely require only variable-length com-

ponents. Lastly, the FLARE interpreter can search all alter-

natives efﬁciently in parallel, since extraction operations can

be expressed as pure functions on an immutable grid. As

a point of reference, the FLARE synthesizer may call the

FLARE runtime many thousands of times during its opera-

tion. As we show in §5 , FLASHRELATE is nonetheless quite

fast.

4. FlashRelate Synthesis Algorithm

The goal of the FLASHRELATE algorithm, shown in

Fig. 8a-e, is to generate a FLARE program that is consis-

tent with the positive and negative examples supplied by

FLASHRELATE 7 2014/4/23

SYNTH(I, P, N)

1 for each column index i

2 AN [i] = LearnN (I, P, i)

3 for each pair of column indices i, j such that i 6= j

4 AE[i, j] = LearnE(I, P, i, j)

5 return SEARCH(∅, N, AN , AE)

(a)

SEARCH(C

, N, AN , AE)

1 if |C

| = NUMCOLS

2 if

(α,β)∈C

Negate(α) ∪ Negate(β) = N

3 return C

4 else return FAILURE

5 else

6 C

= {E | E ∈ C

}

7 pairs = {(N , E) | N ∈ AN [i], E ∈ AE[i, j]

s.t. G

= (V, C

∪ {E}) is loop-free

and i, j ∈ [1 . . . NUMCOLS]}

8 pairs

= RANKP(pairs)

9 k = 0

10 while k < |pairs

11 C

= SEARCH(C

∪ {pairs

[k]}, N)

12 if C

= FAILURE

13 k = k + 1

14 else

15 return C

16 return FAILURE

(b)

LEARNN (I, P, i)

1 A = set of predeﬁned constraints

2 AN = ∅

3 for each constraint α ∈ A

4 if ∀ tuples p ∈ P, [[α]] (I, p[i]) = true

5 AN = AN ∪ {Node(i, α)}

6 return AN

(c)

LEARNE(I, P, i, j)

1 AE = ∅

2 V = LEARNQUANT(I, P, i, j, TRUE)

3 H = LEARNQUANT(I, P, i, j, FALSE)

4 for each {v, h | v ∈ V, h ∈ H}

5 β = Spatial(v,h)

6 for each γ ∈ ENUMSELECT

7 AE = AE ∪ {Edge(i,j,β,γ)}

8 return AE

(d)

LEARNQUANT(I, P, i, j, vert)

1 ∆ = ∅

2 Q = ∅

3 λ = Vert if vert otherwise Horiz

4 φ = snd if vert otherwise fst

5 for each p ∈ P

6 ∆ = ∆ ∪ {λ(φ(p[j]) − φ(p[i]))}

7 if ∀δ

, δ

∈ ∆, δ

= δ

8 Q = any δ ∈ ∆

9 if ∀δ ∈ ∆, δ ≥ 0

10 Q = Q ∪ {λ(∗)}

11 else if ∀δ ∈ ∆, δ ≤ 0

12 Q = Q ∪ {λ(−∗)}

13 else return FAILURE

14 return Q

(e)

Figure 8: FLASHRELATE’s program synthesis procedures. (a) is the top-level procedure. (b) is the program search procedure.

(c) is a subroutine for learning node constraints from positive examples. (d) is a subroutine for learning edge constraints from

positive examples. ENUMSELECT is a macro that enumerates the set of all possible select constraints. (e) learns the direction

and amount of spacing between examples.

the user. A FLARE program is a set of cell and spatial con-

straints, which as we described earlier, represent vertices

and edges in a constraint graph. We thus frame the prob-

lem of ﬁnding a correct program as a search over all valid

sets of these constraints that satisfy the positive and negative

examples.

While FLARE can produce programs that represent arbi-

trary disconnected directed acyclic graphs (DAGs), we limit

the FLASHRELATE’s capabilities to producing connected di-

rected trees. We do this for two reasons: 1) all of the encod-

ings that we have examined thus far can be encoded in the

tree-shaped subset of FLARE, and 2) this formulation allows

us to reduce the problem of ﬁnding a satisfying program to

that of ﬁnding a spanning tree. Thus FLASHRELATE’s pro-

gram search procedure, shown in Fig. 8b, is modeled on a

recursive formulation of Kruskal’s spanning tree algorithm.

We also employ a number of heuristics in FLASHRE-

LATE. Note that there may be many constraint graphs that

satisfy the examples given by the user. While all of these

programs are correct with respect to the user’s examples,

not all of them are what the user wants. Inferring user de-

sires from incomplete speciﬁcations is a difﬁcult problem.

Our heuristics are thus employed to accomplish two goals:

1) they guide the search toward constraints more likely to be

created by users, and 2) speed the search by

4.1 Deﬁnitions

We use the following terms in Fig. 8. Let P be a set of user-

provided tuples representing desired program outputs, from

FLASHRELATE 8 2014/4/23

this point on referred to as positive examples. Let N be a

set of user-provided tuples representing undesired program

outputs, from this point on referred to as negative examples.

NUMCOLS is deﬁned as the number of attributes in a tuple

in P .

Let V be the set of attribute IDs, where each ID represents

a position in the desired relational output tuple in P . Let E

be the complete set of directed edges, each edge referred

to by a pair of attribute IDs (i, j). Let G = (V, E) be the

complete digraph over the attributes in P . The following

two expressions also signify a shorthand for the negative

tuples excluded by a cell constraint and spatial constraint,

respectively:

Negate(α) ≡ {n ∈ N | n /∈ [[α]] (I, c), ∀c ∈ I}

Negate(β) ≡ {n ∈ N | n /∈ [[β]] (c, c

), ∀c, c

∈ I, c 6= c

}

4.2 Algorithm

Informally, the synthesizer must perform the following

tasks, given P and N:

1. Using positive examples, determine which cell con-

straints are possible (Fig. 8c). See §4.2.1 .

2. Using positive examples, determine which spatial con-

straints are possible (Fig. 8d). See §4.2.2 .

3. Identify a combination of cell and spatial constraints that

excludes all negative examples (Fig. 8b). See §4.2.3 .

We discuss a number of implementation details that make

FLASHRELATE’s search procdure efﬁcient in §4.2.4 .

Example We discuss a single round of synthesis using the

algorithm shown in Fig. 8 by way of a running example, the

spreadsheet shown in Fig. 9a. The desired relational output

is shown in Fig. 9b.

FLASHRELATE is intended to be used in the following

interactive manner:

1. The user calls FLASHRELATE with a sample tuple (a

positive example) from the desired relation over data in

the spreadsheet. FLASHRELATE returns a program to the

user.

2. If the program extracts the relational table that the user

wanted, the user is done. Otherwise, the user points out a

discrepancy between the extracted table and the intended

table with one of the following actions:

•

If the extracted table is missing a tuple, the user pro-

vides a new positive example and calls FLASHRE-

LATE again (step 2).

•

If the extracted table contains an undesired tuple,

the user provides a new negative example and calls

FLASHRELATE again (step 2).

Note that while users must provide at least one new posi-

tive or negative example during each round of the procedure

described above for the algorithm to synthesize an improved

FLARE program, they are not limited to providing only a sin-

gle new example.

Suppose our user starts the process by providing the fol-

lowing positive example to the synthesizer, the ﬁrst tuple in

the desired table:

Deerﬁeld 130 Central St. Joe M.

Note that this positive example also encodes the follow-

ing information, a map from each tuple attribute index to

spreadsheet coordinates.

1 −→ (1, 2) 2 −→ (2, 2) 3 −→ (3, 2)

FLASHRELATE uses both representations. The ﬁrst rep-

resentation describes the contents of a tuple; the second rep-

resentation describes the spatial relationships between at-

tributes in a tuple.

4.2.1 Step 1: Determine cell constraints

The FLASHRELATE synthesizer constructs cell constraints

from regular expressions. A regular expression comes from

one of two places: 1) it is dynamically constructed from a

small set of standard character class tokens (EmptyCellTok,

WhiteSpaceTok, AlphaTok, NumTok, and PunctTok), or

2) it comes from a small collection of commonly occurring

string patterns that we identiﬁed while studying spreadsheets

in the EUSES corpus. This strategy has been used by oth-

ers [13] in research for learning string programs.

Regular expression learning algorithms are outside the

scope of this paper, but the topic is well-studied [2]. Thus,

the primary focus in this paper is in learning geometric

patterns (see §4.2.2 ). FLASHRELATE is designed to work

with any learning procedure that can learn from a set of

positive string examples.

Since cell constraints may also include anchor con-

straints, we use the following simple anchor synthesis scheme.

Given a set of positive example cells for attribute i, the

anchor synthesizer searches each cell’s neighbors for a

similarly-located common string. If such a common neigh-

bor is discovered, an Anchor is added as a possibility for the

appropriate cell constraint. In practice, we ﬁnd that empty

cells are frequently used as anchors by the synthesizer.

Line 1 in Fig. 8a calls LearnN (Fig. 8c) for each the set

of strings corresponding to each attribute ID i in the set of

positive examples. LearnN eliminates those constraints that

do not match all of the strings associated with i.

Returning to our example, if the synthesizer is given the

set of regular expressions shown in Fig. 10, LearnN returns

the constraints shown in Fig. 11.

4.2.2 Step 2: Determine spatial constraints

Line 3 in Fig. 8a calls LearnE (Fig. 8d) for each pair of

attribute indices (i, j) in P . LearnE ﬁnds all possible spatial

constraints that satisfy the observed spatial layout between

columns from positive examples.

For each positive example p, LearnE calculates the delta

for the cells (lines 2-3) p[i] and p[j]. Each delta is the dis-

tance in either the horizontal or vertical direction, expressed

as a number of cells, between p[i] and p[j].

FLASHRELATE 9 2014/4/23

1 2 3

1 Town Address Owner

2 Deerﬁeld 130 Central St. Joe M.

3 10 Whately Ave. Mary L.

4 TOTAL 2

5 Amherst 10 North Pleasant St. Bob O.

6 55 Westman Lane Claudia S.

7 2 Rectangle Ct. Alyssa B.

8 TOTAL 3

9 Hadley 5 Rocky Rd. Greg S.

10 34 Godell Rd. Omar L.

11 TOTAL 2

1 2 3

1 Deerﬁeld 130 Central St. Joe M.

2 Deerﬁeld 10 Whately Ave. Mary L.

3 Amherst 10 North Pleasant St. Bob O.

4 Amherst 55 Westman Lane Claudia S.

5 Amherst 2 Rectangle Ct. Alyssa B.

6 Hadley 5 Rocky Rd. Greg S.

7 Hadley 34 Godell Rd. Omar L.

Figure 9: (a) An input spreadsheet. (b) The desired relational table. Note that missing ﬁelds have been inserted and headers and

summary rows have been removed.

^[0-9]+$ ^[a-z]+$

^[A-Z]+$ ^[A-Z][a-z]+$

^[0-9]+ [a-zA-Z. ]+$ ^[a-zA-Z. ]+$

Figure 10: The example uses the above set of regular expres-

sions.

{ Node(1, ^[A-Z][a-z]+$,⊥) }

{ Node(2, ^[0-9] [a-zA-Z. ]+$,⊥) }

{ Node(3, ^[a-zA-Z. ]+$,⊥) }

Figure 11: Candidate cell constraints after calling LearnN .

Edge(1, 2, Vert(0), Horiz(1)), Select(All, All))

Edge(1, 2, Vert(0), Horiz( ∗ )), Select(All, x))

Edge(1, 2, Vert( ∗ ), Horiz(1)), Select(y, All))

Edge(1, 2, Vert( ∗ ), Horiz( ∗ )), Select(y, x))

. . .

Figure 12: Candidate Edge constraints from the run-

ning example. Only the constraints for attribute pair

(1, 2) are shown, x ∈ {NearX, FarX, All}, and y ∈

{NearY, FarY, All} for space reasons.

Node(1, ^[A-Z][a-z]+$,⊥)

Node(2, ^[0-9] [a-zA-Z. ]+$,⊥)

Node(3, ^[a-zA-Z. ]+$,⊥)

Edge(1, 2, Vert(0), Horiz(1)), Select(All, All))

Edge(2, 3, Vert(0), Horiz(1)), Select(All, All))

Figure 13: The program synthesized given a single positive

example.

Deerﬁeld 130 Central St. Joe M.

Amherst 10 North Pleasant St. Bob O.

Hadley 5 Rocky Rd. Greg S.

Figure 14: Program output after running the synthesis algo-

rithm with a single positive example.

Given the set of deltas derived from all the positive ex-

amples for i and j, the set falls into one of more of the fol-

lowing classes. We use delta class information when calling

Edge constructors to generate spatial constraints.

1. All the deltas represent a ﬁxed distance, and the set of

deltas is a bijective relation from p[i] to p[j]. This class

produces constant-length quantity values for Vert() or

Horiz() constructors.

2. Delta distances vary, and the set of deltas is a bijective re-

lation from p[i] to p[j]. This class produces Kleene (“*”)

quantities for Vert() and Horiz() constructors and

“match-single” select type tags (NearX, FarX; NearY,

FarY).

3. Delta distances vary, and the set of deltas is not a bijec-

tive relation from p[i] to p[j]. Speciﬁcally, one or more

distinct p[i] maps to two or more p[j]. This class pro-

duces Kleene (“*”) quantities for Vert() and Horiz()

constructor and “match-all” select type tag (All).

We also use delta class information when pruning for

efﬁciency reasons (see 4.2.4, “Pruning”). The set of possible

spatial constraints inferred for ID pair (1, 2) for our single

positive example is shown in Fig. 12.

4.2.3 Step 3: Find a satisfying set of constraints

Next FLASHRELATE searches for a set of constraints that

satisfy its negative examples (Fig. 8a). As shown in Fig. 8b,

there are ﬁve essential steps in this recursive procedure:

1. Exclude (Node, Edge) constraint pairs that would intro-

duce a loop into the program graph given the current set

of chosen constraints (line 7).

2. Call RankP to rank constraint pairs (line 8). We discuss

RankP in §4.2.4 .

3. Choose a constraint pair (line 10).

4. Recursively choose the next pair of constraints (line 11).

5. If constraints have been found for all the attributes in

the relation, ensure that the program excludes all of the

negative examples (line 2) and return the program. If not,

backtrack (line 4).

We deﬁne here some additional terms used in Fig. 8.

Lines 6-16 represent the algorithm’s implementation of non-

deterministic choice. Each iteration of the while loop rep-

resents a choice point. C

denotes the a set of chosen edges

at a particular choice point. C

denotes the set of chosen

constraint pairs (α, β) at the same choice point where α is

FLASHRELATE 10 2014/4/23

a Node(j, r, ) constraint and β is an Edge(j, k, v, h, γ)

constraint. pairs denotes the set of all valid constraint pairs

at the same choice point.

The search space may contain numerous solutions. The

algorithm is free to choose any correct program that excludes

all of the negative examples. The implementation of RankP

determines which pair of constraints the search ultimately

chooses. We discuss these implementation choices in §4.2.4

In our example, suppose the synthesizer chooses the pro-

gram shown in Fig.13. Since, in our example, the user has

provided no negative examples, the chosen program trivially

satisﬁes the criteria on line 2 in Fig. 8. Were this not the

case, the search would backtrack and consider a different

constraint pair.

The output of this program is shown in Fig. 14. While all

of the extracted tuples are correct, the output is missing rows

with omitted towns in the original spreadsheet (Fig. 9a). The

user’s next step is to add another positive example (e.g., the

tuple in Fig. 9b) and call the synthesis algorithm again

(Fig. 8a). This procedure continues until the user is satisﬁed

with the output of the algorithm. We omit these additional

user-interactions for brevity.

4.2.4 Implementation Details

The efﬁcient operation of the synthesis algorithm depends

on three key implementation ideas:

1. Pruning of the search space, for efﬁcient search.

2. Ranking schemes for synthesizer search choices.

3. Choice of data structures.

Pruning Constant-length spatial constraints can be ruled

out when the geometric form has been shown to vary. In our

example in Fig. 9, all Edge constraints containing Vert(0)

for (1, 2) can be ruled out because the cell in the second

column in the second tuple (“10 Whately Ave.”) is not a

constant-length distance vertically from the ﬁrst cell (“Deer-

ﬁeld”). In the ﬁrst tuple, the vertical distance is zero; in the

second the distance is 1.

Match-single spatial constraints can also be ruled out in

some cases. In the same example mentioned above, it is

not possible for the Edge constraint for (1, 2) to contain a

Select() with the tags NearY or FarY. This is because it is

possible to match two different cells (“130 Central St.” and

“10 Whately Ave”) for attribute 2 from the starting cell for

attribute 1 (“Deerﬁeld”).

Ranking FLASHRELATE often has a number of alterna-

tive constraints to choose. The balance of speciﬁc vs general

constraints is important because it impacts both the speed

of the synthesizer and the number of examples required by

the user. Favoring speciﬁc constraints may make the search

fast, because they are more likely to rule out negative exam-

ples. However, speciﬁc constraints may also mean that the

user must provide more positive examples before the correct

program is found. Conversely, favoring general constraints

may fail to exclude negative examples, causing the search

to backtrack frequently, resulting in a slow search. General

constraints may require a user to provide more negative ex-

amples.

We tend to favor more speciﬁc programs over more gen-

eral programs. This enables users to focus on what they

want, instead of what they don’t want, which we believe to

be more natural. This design choice also allows for faster

search. We rank constraint pairs by the following schemes,

in this order:

1. R1 Constraints that exclude large numbers of negative

examples are favored over constraints that exclude few.

This scheme ensures that the search favors constraints

that exclude negative examples.

2. R2 Speciﬁc spatial constraints are favored over general

spatial constraints. This scheme implies that multiple

positive examples are required to learn non-constant-

length spatial constraints.

3. R3 Straighter programs are favored over ones with more

bends. This simpliﬁes programs.

4. R4 We consider the cells above and to the left of positive

examples to be implicit negative examples.

Data Structures The effect of a constraint on a spreadsheet

is a set of cells. We represent node matches as a bit vector

(using .NET’s BigInteger struct). Likewise, we represent

edge matches as a mapping from cell coordinates to bit vec-

tors. In both cases, the length of the bit vector corresponds

to the number of cells in the spreadsheet, and the coordinate

of a given index in the bit vector is determined by a map-

ping function, f. In the bit vector, a “1” at index i means

that the constraint matched a cell at coordinate f(i) in the

spreadsheet; a “0” indicates no match.

This representation is efﬁcient for three reasons: 1)

Matches are stored efﬁciently. For Node constraints, only

one bit per cell is required. 2) Questions such as “does this

constraint satisfy all positive examples?” can be answered in

constant time using bit vector arithmetic. For this purpose,

we bitwise-AND the effect of a constraint with the bit vec-

tor representing the positive examples. If the resulting bit

vector equals the positive example bit vector, then answer

is yes, otherwise no. 3) Operations requiring bit-counting

can be done in O(# bits set) using Wegner’s bit-counting

algorithm [27]. Bit-counting is frequently used in our rank-

ing schemes to count the number cells matching a particular

criteria.

We are able to evaluate partial programs efﬁciently during

search using a dynamic programming scheme by caching

the effect of each constraint on the input spreadsheet. Lastly,

partial program evaluation can largely happen in parallel and

before the synthesizer runs (i.e., ofﬂine). With a large set of

constraints, this is especially beneﬁcial, as the system can

process constraints in the background before users invoke

the program synthesizer.

FLASHRELATE 11 2014/4/23

5. Evaluation

In this section, we evaluate the design of FLARE and

FLASHRELATE on a variety of real-world spreadsheets. The

purpose of this evaluation is to answer the following ques-

tions:

1. It is possible to manually write FLARE programs to per-

form a diverse set of extraction tasks?

2. Can FLASHRELATE automatically infer equivalent pro-

grams for the same set of tasks as in question 1?

3. How effective are our heuristics at reducing the execu-

tion time and number of examples required by the syn-

thesizer?

5.1 Benchmark Spreadsheets and Tasks

To evaluate FLASHRELATE, we assembled a collection of 43

benchmarks using spreadsheets taken from other work on

reorganizing spreadsheet tables, from our own microbench-

marks for testing purposes, and from a large spreadsheet cor-

pus created for research purposes.

Benchmark Selection Our evaluation considers two sets

of benchmarks

. The ﬁrst set of benchmarks were borrowed

from prior work [16] that examined 51 table-transformation

programs from Excel user-help forums. Despite the appar-

ent complexity of these tasks, we found to our surprise

that nearly half (22) of the transformations were straight-

forward relational extraction tasks in FLARE. To round out

our evaluation with more difﬁcult tasks, we assembled a sec-

ond set of benchmarks by searching the EUSES spreadsheet

corpus [11] for spreadsheets with complex ad-hoc encod-

ings. This second set of benchmarks was chosen speciﬁcally

to test our synthesis algorithm against challenging extrac-

tion tasks. We also supplemented this second set with syn-

thetic benchmarks known to present challenges to our syn-

thesizer. As a measure of the complexity of the extractoin

task for each benchmark, we note the number of variable-

length Edge constraints that appear in our ground truth pro-

grams (“k*” in 16).

Expressiveness To evaluate the expressiveness of FLARE,

we manually wrote a correct Flare program for each bench-

mark and extracted the resulting output table. In the course

of this effort, we found that the FLARE language is expres-

sive enough to extract the desired tuple from all of multi-

dimensional data patterns we observed. We conclude that,

independent of our ability to synthesize a FLARE program,

FLARE is an effective tool for expressing data extraction

tasks against semi-structured data projected into a two-

dimensional grid.

Synthesizer Experiments Using our ground truth pro-

grams, we compare the results of the FLASHRELATE syn-

thesis algorithm to evaluate its effectiveness. Recall that syn-

thesis depends on providing a set of positive and negative

examples to the synthesis algorithm as described in §4 .

All benchmarks will be made available online.

The following method is intended to simulate a user in-

teracting with FLASHRELATE. The relational table extracted

by the ground truth query represents the user’s desired ex-

traction output; we call this the oracle. After each invocation

of the FLASHRELATE algorithm by our simulated user with

a set of examples, we determine whether the synthesized

program is correct by comparing its output against the or-

acle. When the FLASHRELATE output differs, the user ﬁnds

the ﬁrst tuple that deviates from the oracle by scanning the

extracted table from top to bottom. Deviant tuples come in

two forms: 1) if a tuple from the oracle is missing from the

program output, it is a positive example; 2) if a tuple from

the program output does not appear in the oracle, it is a neg-

ative example. We repeat this process until the synthesizer

either ﬁnds a program whose output matches the oracle or

we time-out. 10 minutes was chosen as the maximum total

duration of the task as we felt that typical users longer wait

times intolerable.

In our experiments, we consider 5 algorithm conﬁgura-

tions to understand the beneﬁt of our ranking choices. In all

cases, regular expressions come from a small corpus (< 100)

of common patterns combined with the regular expression

generator described in § 4. Each conﬁguration examines the

effect of adding a ranking scheme to the FLASHRELATE al-

gorithm. Experiment conﬁgurations are: 1) all rankings, 2)

R1, R2, and R3, 3) R1 and R2, 4) R1, and 5) no rank-

ing. For each ranking scheme, R# refers to the procedure

described in §4.2.4 (Ranking).

5.2 Results

Fig. 15a shows the total time it takes to synthesize our

43 benchmarks. The y-axis shows the running time of the

synthesis algorithm in cases where it succeeded. The axis

is truncated at 60 seconds since, with ranking, most bench-

marks succeed well before that time.

In the best case (all rankings), the algorithm failed to ﬁnd

a solution within 10 minutes for only 1 out of 43 bench-

marks. When synthesis found a correct solution, more than

80% of the benchmarks completed in less than 10 seconds

total. Per-iteration time is extremely fast: typically a user

only has to wait 1.6 seconds (median: 0.6 seconds).

Fig. 15b shows the number of examples (iterations) re-

quired to synthesize the correct result. Users only need to

provide an average of 3.5 positive examples (median: 3 pos-

itive examples) and 2.0 negative examples (median: 1 neg-

ative example). Without ranking, the algorithm is signiﬁ-

cantly slower, and fails to ﬁnd a solution more often before

a timeout occurs (for 13 total timeouts). Without ranking,

the algorithm also requires many more examples: an aver-

age of 3.0 positive and 13.7 negative examples (medians:

3 positive; 9 negative). We conclude that without ranking,

sometimes very general solutions are generated (matching

too many cells) and thus numerous negative examples are

required to sufﬁciently narrow the selection.

FLASHRELATE 12 2014/4/23

10"

20"

30"

40"

50"

60"

0" 10" 20" 30" 40" 50" 60" 70" 80" 90" 100"

#"of"Seconds"

%"benchmarks"synthesized"in"≤""Y"total"seconds"

all"

R1,R2,R3"

R1,R2"

R1"

none"

10"

15"

20"

25"

30"

35"

40"

0" 10" 20" 30" 40" 50" 60" 70" 80" 90" 100"

#"of"Examples"

%"of"benchmarks"requiring"≤"Y"examples"

all"

R1,R2,R3"

R1,R2"

R1"

none"

(a) (b)

Figure 15: (a) Total benchmark synthesis times with ranking and without ranking schemes. Lower numbers are better. For

example, none (no heuristics) succeeded on roughly 60% of the benchmarks before the timeout whereas all succeeded on all

but one. (b) Number of examples required to successfully synthesize a program with and without ranking schemes. Lower

numbers are better. Generally, all required fewer examples than the other heuristics.

pre worst pos # # # # pre worst pos # # # #

(sec) (sec) + neg k* col rec cells (sec) (sec) + neg k* col rec cells

hg ex2 0.1 5.7 3+2 0 9 14 270 h8d62ck 0.2 2.7 5+5 2 8 62 864

hg ex3 0.0 0.2 2+2 0 2 15 168 02rise 0.0 5.0 4+1 0 12 6 175

hg ex4 0.0 0.2 2+1 1 2 6 9 03-1-report 0.1 0.4 6+1 4 4 177 702

hg ex5 0.0 0.2 2+1 0 2 92 128 03PFMJOUR 0.1 2.0 3+2 1 7 10 288

hg ex6 0.0 0.3 5+2 2 3 9 68 1213 0.1 1.0 3+2 4 4 116 350

hg ex7 0.0 19.9 3+1 0 17 3 171 1214 1.7 0.9 5+2 3 3 62 162

hg ex8 0.0 1.1 5+4 3 5 250 240 2003FinalPop 0.5 0.2 3+2 1 2 191 1113

hg ex9 0.0 0.1 1+1 0 1 8 8 3 4 0.2 1.3 3+4 1 5 68 522

hg ex10 0.0 0.2 2+1 1 2 20 30 3q2000 0.9 19.3 4+1 0 19 5 3780

hg ex11 0.0 0.3 3+3 2 3 19 42 3yrsegment 0.1 1.7 3+1 2 5 55 750

hg ex12 0.1 1.9 3+1 0 9 3 252 4q03fax 0.2 0.9 7+3 2 3 186 1394

hg ex13 0.0 0.5 4+2 4 4 5 63 act3 23 0.0 0.7 3+4 3 3 24 175

hg ex17 0.0 0.4 1+1 0 4 16 160 act4 023 0.1 0.6 4+3 2 4 222 312

hg ex18 0.0 2.0 3+2 0 6 3 28 Appen4-5 TO TO NA 3 4 99 1020

hg ex26 0.0 0.4 1+1 0 4 4 16 ascap 2000 0.3 0.8 5+1 0 6 168 252

hg ex29 0.0 4.0 1+1 0 9 3 60 ascap 2001 0.1 0.8 5+1 2 6 39 1037

hg ex35 0.6 6.4 4+1 0 14 36 2786 e-learning 0.2 2.2 8+15 4 6 127 522

hg ex36 0.0 0.8 1+1 0 6 3 36 Favorite 0.0 0.3 3+3 2 3 36 68

hg ex37 0.0 2.0 1+1 0 10 5 55 historical 0.3 1.3 10+1 3 4 251 1911

hg ex39 0.1 0.2 2+2 0 2 49 146 Staircase 0.0 0.3 3+2 2 3 3 9

hg ex40 0.0 8.9 1+1 0 12 1 12 Table 201 0.2 4.9 7+1 4 9 45 540

hg ex41 0.1 3.2 7+1 2 10 171 840

Figure 16: Benchmarks from [16] are shown on the left while benchmarks from [11] are shown on the right. pre (sec) is the

precomputation time; worst (sec) is the duration of the longest iteration; pos+neg is the number of positive and negative

examples required for synthesis; # k* is the number of Kleene stars in the ground truth program, a measure of its complexity;

# col is the number of columns in the output table; # rec is the number of records in the output table; # cells is the number

of cells in the used range of the input spreadsheet. TO indicates that the synthesis algorithm timed-out after 10 minutes. We

abbreviate ﬁle names for compactness.

Table 16 summarizes the time and number of examples

required by the algorithm with all rankings enabled as well

as benchmark complexity.

We ﬁnd that FLASHRELATE is quite successful at syn-

thesizing the appropriate program quickly and with a small

amount of user effort. Furthermore, while one ranking

scheme sacriﬁces a slightly higher speed for fewer itera-

tions (R4), which we consider to be an acceptable tradeoff,

generally our ranking schemes reduce both the number of

examples required and the time the user has to wait.

FLASHRELATE 13 2014/4/23

5.3 Experimental Setup

We evaluated synthesis on typical end-user hardware. Our

test machine was a AMD Phenom II X4 940 quad-core

desktop machine running at 3GHz with 4GB of RAM.

FLASHRELATE was written in a mix of F# and C# as a

Microsoft Excel 2010 VSTO.NET plugin for Windows 8.

5.4 Future Work

Users sometimes use other encodings, like color or type-

face formatting. While FLARE does not currently support

constraints based on these attributes, they are a straight-

forward extension to our work. Combining FLASHRELATE

with other program synthesis techniques would allow end

users to handle ad-hoc in-cell encodings [15].

6. Conclusion

The ﬂexibility of spreadsheets allows users to combine data

deﬁnitions and data views, providing expressiveness at the

expense of utility. The ad-hoc structure of such spreadsheets

makes it challenging to leverage existing tools (for data

analysis, etc.) that require data to be present in a standard

relational format. We present FLARE, the ﬁrst language to

allow users to express relational extraction queries against

semi-structured, two-dimensional spreadsheet data. FLARE

is much like a two-dimensional AWK. We also present

FLASHRELATE, an algorithm for the automatic synthesis of

FLARE programs from user-provided positive and negative

examples. Using EUSES and other spreadsheet benchmarks,

we show that we can often synthesize programs that users

want, making FLARE accessible to non-experts.

A video demonstration of FLASHRELATE is available at:

http://tinyurl.com/oz72quh

References

[1] R. Abraham and M. Erwig. Header and unit inference for

spreadsheets through spatial analyses. In VL/HCC, pages

165–172, 2004.

[2] D. Angluin. Learning regular sets from queries and counterex-

amples. Inf. Comput., 75(2):87–106, 1987.

[3] M. Anselmo, D. Giammarresi, and M. Madonia. Regular ex-

pressions for two-dimensional languages over one-letter al-

phabet. In Developments in Language Theory, Lecture Notes

in Computer Science. 2005.

[4] M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data

on the web. CACM, 54(2):72–79, 2011.

[5] Z. Chen and M. Cafarella. Automatic web spreadsheet dat

extraction. In SSW’13, 2013.

[6] Z. Chen, M. Cafarella, J. Chen, D. Prevo, and J. Zhuang. Sen-

bazuru: a prototype spreadsheet database management sys-

tem. PVLDB, 6(12):1202–1205, 2013.

[7] J. Cunha, J. Saraiva, and J. Visser. From spreadsheets to

relational databases and back. In PEPM 2009, pages 179–

188. ACM, 2009.

[8] A. Das Sarma, A. Parameswaran, H. Garcia-Molina, and

J. Widom. Synthesizing view deﬁnitions from data. In ICDT,

2010.

[9] E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. Web

data extraction, applications and techniques: a survey. arXiv

preprint arXiv:1207.0246, 2012.

[10] K. Fisher and D. Walker. The PADS project: an overview. In

ICDT, 2011.

[11] M. I. Fisher and G. Rothermel. The EUSES Spreadsheet

Corpus: A shared resource for supporting experimentation

with spreadsheet dependability mechanisms. In 1st WEUSE,

pages 47–51, 2005.

[12] D. Giammarresi, F. Venezia, and A. Restivo. Two-dimensional

languages, 1997.

[13] S. Gulwani. Automating string processing in spreadsheets

using input-output examples. In POPL, 2011.

[14] S. Gulwani. Synthesis from examples: Interaction models and

algorithms. In SYNASC, 2012.

[15] S. Gulwani, W. Harris, and R. Singh. Spreadsheet data ma-

nipulation using examples. CACM, Aug 2012.

[16] W. R. Harris and S. Gulwani. Spreadsheet table transforma-

tions from examples. In PLDI, 2011.

[17] J. Kari and C. Moore. Rectangles and squares recognized by

two-dimensional automata. Theory Is Forever, 2004.

[18] H. Lieberman. Your Wish Is My Command: Programming by

Example. Morgan Kaufmann, 2001.

[19] E. Lu, R. Bodik, and B. Hartmann. Quicksilver: Auto-

matic Synthesis of Relational Queries. Technical Report

UCB/EECS-2013-68, UC-Berkeley, May 2013. URL

http://www.eecs.berkeley.edu/Pubs/TechRpts/

2013/EECS-2013-68.html.

[20] O. Matz. Regular expressions and context-free grammars for

picture languages. In STACS, 1997.

[21] E. Oro and M. Ruffolo. Sila: a spatial instance learning

approach for deep webpages. In CIKM 2011, pages 2329–

2332. ACM, 2011.

[22] E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath

towards spatial querying on web documents. PVLDB, 4(2):

129–140, 2010.

[23] M. Pradella, A. Cherubini, and S. C. Reghizzi. A unifying

approach to picture grammars. Information and Computation,

209(9), 2011.

[24] T. Register. Microsoft feeds excel to supercomputer, Nov.

2009. URL http://www.theregister.co.uk/2009/11/

18/sc09_microsoft_excel_hpc/.

[25] R. Singh and S. Gulwani. Learning semantic string transfor-

mations from examples. PVLDB, 5, 2012.

[26] Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query by output.

In SIGMOD, 2009.

[27] P. Wegner. A technique for counting ones in a binary com-

puter. Commun. ACM, 3(5):322–, May 1960. ISSN 0001-

0782. . URL http://doi.acm.org/10.1145/367236.

367286.

FLASHRELATE 14 2014/4/23