The Shrdlite project

In this project you will implement a dialogue system for controlling a robot that lives in a virtual block world. The robot can move around objects of different forms, colors and sizes, and it can answer questions about the world and ask for clarification whenever necessary.

The inspiration is Shrdlu, by Terry Winograd 1970. For Winograd it was a PhD project, but now there are tools available that make things easier: grammar formalisms, ontologies, planners, etc.

You will not have time to create a system that can do everything that Shrdlu can, so you only have to implement a subset of Shrdlu’s capabilities.

The original Shrdlu

Here are some overviews about the original Shrdlu system, including a demo video:

Introduction

The goal of the project is to create an interpreter and a planner so that a person can control a robot in a blocks world to move around objects, by giving commands in natural language.

To make the project more interesting, there is a web-based graphical interface from which the user can interact with the blocks world.

The interface is written in TypeScript (which compiles into Javascript), and it can be run in two different modes:

To be able to run the system you need to install Node.js, NPM and TypeScript. Do that. Now.

About TypeScript

TypeScript is a typed superset of Javascript that compiles to plain Javascript. It is open-source and not specific to any browser or operating system. For information about the language, please visit the official site:

Before everything, you need to install Node.js, NPM and TypeScript:

TypeScript is an extension of Javascript to make it more similar to standard object-oriented languages such as Java. Furthermore it is statically typed, to reduce programming errors. Here are some documentation and tutorials on TypeScript:

You also need to set up Git and GitHub within your group.

What is missing

Most of the project is already implemented, such as two different implementations of the blocks world: one that uses SVG/HTML graphics, and another that works in a plain-text console. Furthermore, the the natural language parser is already implemented using the Nearley parsing library.

What is not implemented correctly is the natural language interpreter and the robot planner. What you are given are stubs that return a dummy interpretation and a dummy plan respectively. Your goal is to implement the interpreter and the planner so that the robot behaves as it should.

The following functions are only templates that you have to fill with intelligent code to get the project to work correctly:

Depending on which extension you want to implement, there might be other places in the code that need modification.

The GitHub project

Set up your GitHub project using the instructions here.

Here is how the template system should behave when you install it from scratch. Regardless of what you say, it will always return a dummy plan which moves a random block around a bit and then puts it back where it started:

The completed project

When you have completed the project you should have a system that works like this:

Select one of the example sentences and press ‘enter’, such as this one (for the “small” world):

The system parses the sentence, and tries to infer a sequence of simple actions that accomplishes the task.

Overview

The user interface is an interactive webpage that reads input from the user, does some AI stuff and presents the results in the browser. There is also a text-based version that is invoked from the command line, using Node.js. The AI stuff is done on the client in a Javascript program, which is compiled from TypeScript.

Compiling TypeScript into Javascript

There are several ways to compile TypeScript, depending on how strict you want the typechecker to be, and on what kind of Javascript module system you want to compile into.

The Shrdlite system is designed to be used with the UMD module system, so please always use --module umd when you compile your TypeScript files:

tsc --module umd FILE.ts

Furthermore, we advise you to use the strictest possible version of the typechecker, which gives this long commandline:

tsc --module umd --alwaysStrict --noFallthroughCasesInSwitch --noImplicitReturns --noImplicitAny --noImplicitThis FILE.ts

We suggest that you store this in a script or as an alias. Or that you use the GNU Make system.

Using GNU Make

There is a Makefile if you want to use the GNU Make system. Here’s what it can do:

Running the system

To start the browser-based system, you first compile shrdlite-html.ts and then you open shrdlite.html in your favourite browser.

Note that it is very difficult to test and debug a system in a browser, so therefore it is probably better to use the command-line interface while developing the system. To do this, you compile shrdlite-offline.ts and run it using Node.js:

node shrdlite-offline.js WORLD EXAMPLE

where WORLD is one of the possible example worlds (“small”, “medium” or “complex”), and EXAMPLE is either a number (denoting one of the example sentences) or an utterance (within quotes).

Data structures

You probably want to use some kind of collection datatype (such as a set, a heap and/or a priority queue), so the de facto-standard library typescript-collections is already included in the repository (in the directory lib/typescript-collections).

To use a datatype from the collections library, you import it like this:

import PriorityQueue from "./lib/typescript-collections/src/lib/PriorityQueue";

Using JavaScript modules in TypeScript

If you want to use standard JavaScript libraries in TypeScript, you have to add a TypeScript declaration file for that library. The DefinitelyTyped library contains declaration files for several libraries, such as the following two:

In fact, these two are already included in the lib directory. Read more about declaration files in the TypeScript documentation.

JavaScript chart parser

The parser is generated by Nearley. The grammar is in the file Grammar.ne, and it is compiled into the TypeScript file Grammar.ts. If you plan to make any changes in the grammar, you have to install Nearley.

Updated 18th May: The latest version of Nearley (2.9) has a bug which gives a Typescript error. To be able to compile the grammars anyway, it’s best to downgrade to Nearley version 2.8:

npm install -g nearley@2.8

Overview of the Shrdlite pipeline

The Shrdlite system consists of the following pipeline:

  1. Input:

    • A user utterance is read from the webpage, all words separated by whitespace.
    • Output: a list of strings
  2. Parsing:

    • The utterance is parsed into zero, one or several semantic trees.
    • Note: the grammar and parser are already given to you.
    • Output: a list of parse results
  3. Interpretation:

    • Each parse result is interpreted in the current world.
    • Note: some parses might not have an interpretation, and some parses might have several interpretations.
    • Output: a list of logical interpretations
  4. Ambiguity resolution:

    • If the utterance results in several different interpretations, the ambiguity should be resolved in some way.

      The easiest (and worst) solution is to simply reject the utterance.

      A better solution could be to ask a clarification question, in which case the program must exit by outputting the question, and then resuming when the user has answered.

    • Depending on your preferences, you might want to put the ambiguity resolution after the planner instead.
    • Output: a single interpretation (or failure, or a clarification question)
  5. Planning:

    • The interpretation is solved in the current world, resulting in a plan of basic commands that can take the current world into the goal state.
    • Output: a list of basic commands
  6. Output:

    • The final list of basic commands is performed by the virtual robot.

The world

Shrdlite uses a 2-dimensional blocks world, not 3d. The main reason for this is that it is much easier to visualize, but also because spatial reasoning becomes easier.

The world contains a floor, and a number of objects. The objects can be on top of or inside each other (if the form and size permits). The goal of the “game” is to move around the objects in any way the user likes, a bit like Minecraft or Lego. This can only be done by talking to a robot arm which can pick up and put down objects.

The world contains at least the following forms, colors and sizes:

To simplify things, the floor only has room for at most N objects at the same time. We call each of these places a stack or a column, and number them 0…N-1.

This means that internally, the world can be represented as a list of stacks, where each stack is a list of objects, where each object has a unique identifier. This is not the only possible representation (another is to e.g. use a set of logical statements), but the one we use here. This is how the world state is implemented in World.ts:

export interface WorldState {
    stacks: Stack[][];
    holding: string;
    arm: number;
    objects: { [s:string]: ObjectDefinition; };
    examples: string[];
}

Each stack is a list of strings, where the strings themselves are unique identifiers for the objects in the world. The state also contains information about which object the robot arm is holding (if any), the position of the arm, descriptions of each object, and a list of example utterances.

There are four example worlds implemented in the file ExampleWorlds.ts: “small”, “medium”, “complex” and “impossible”. The impossible world contains examples that break the physical laws in different ways, so it is not supposed to be used. But the other three worlds can be used in both the browser version and the text-based version. You can also add new worlds to the file ExampleWorlds.ts, it should be fairly self-explanatory.

Physical laws

The world is ruled by physical laws that constrain the placement and movement of the objects:

There is an example world in the project template called “Impossible”, which gives examples of bricks that break the physical laws in some way.

Spatial relations

The following spatial relations can be used to describe how objects are to be positioned:

Note that these definitions are not set in stone; if you want you can use different definitions.

The basic commands

A plan is a sequence of basic commands, and there are only four of them:

Parsing

User utterance interpretation comes in two steps. First the utterance has to be parsed. To ease the burden, the grammar and parser is already given to you, using the Javascript library Nearley:

The parse result is a Javascript object reflecting the syntactic structure of the input utterance. The object doesn’t contain all details of the specific grammar, but instead only the important semantic parts.

Ambiguities

Note that the grammar is ambiguous, meaning that it can return several parse results for a single input utterance. This is on purpose, since natural language often is ambiguous. Here is an example utterance that returns two trees:

The two trees correspond to the meaning of the following two unambiguous utterances:

The parser module returns all possibilities with the hope that the interpretation module can disambiguate the intended meaning in the current world. For example, if there is no ball inside any box, then the first meaning can be filtered out.

Overview of the grammar

Here is a quick overview of how the grammar works. All utterances start with the nonterminal Command.

Detailed description of the phrase-structure rules

The grammar is specified in a special grammar file which is compiled into Javascript by the Nearley library. Here is a quick explanation of the Shrdlite grammar, as it is defined in the file Grammar.ne. All types and classes are defined in the file Types.ts.

You don’t have to make any changes to the grammar or parser (unless you want to augment the grammar with new commands or linguistic constructions).

Command

Either the robot can pick up something, drop the thing it’s holding, or move something to a new location:

command --> ("take"|"grasp"|"pick up")  entity
         |  ("move"|"put"|"drop")   "it"   location
         |  ("move"|"put"|"drop")  entity  location

The parser returns a TypeScript object conforming in one of the Command classes:

export type Command = TakeCommand | DropCommand | MoveCommand;
export class TakeCommand {
    constructor(public entity:Entity)
}
export class DropCommand {
    constructor(public location:Location)
}
export class MoveCommand {
    constructor(public entity:Entity, public location:Location)
}

Using phrase-structure grammar terms, the command is an imperative phrase (which in English is the same as a verb phrase). The verb “take” is a transitive verb, taking a noun phrase as argument, while “move” is a ditransitive verb, taking a noun phrase and an adverbial as arguments.

Entity

An entity consists of a quantifier followed by an object description:

entity  -->  quantifier[Number]  object[Number]

The arguments have to be both singular or both plural, i.e., they have to agree in grammatical number. The parser returns an object in the class Entity:

export class Entity {
    constructor(public quantifier:string, public object:Object)
}

In phrase-structure grammar, the entity corresponds to a noun phrase, which in turn is constructed by a determiner followed by a complex noun.

Location

A location is a relation followed by an entity:

location  -->  relation  entity

The parser returns an object in the class Location:

export class Location {
    constructor(public relation:string, public entity:Entity)
}

Using phrase-structure grammar terms, the location is a prepositional phrase which in turn is a kind of adverbial. The location consists of a relation, which is a preposition, and an entity (which we already know is a noun phrase).

Object

An object can either be simple, consisting of an optional size, optional color and a form:

object[Number]  -->  size?  color?  form[Number]

The size and color happen to be adjectives and the form is a noun. Since English adjectives aren’t inflected for number, it’s only the form that can be either singular or plural.

An object can also be complex:

object[Number]  -->  object[Number]  "that is/are"?  location

Note that the number of the complex object depends solely on the number of the immediate object. This is because the meaning of the complex object is that the location “filters out” possibilities that are given by the immediate object.

Also note that the relative pronoun+copula (“that is/are”) is optional, and it is this optionality that makes utterances ambiguous.

Depending on the type of object, the parser returns either a SimpleObject or a RelativeObject:

export type Object = RelativeObject | SimpleObject;
export class RelativeObject {
    constructor(public object:Object, public location:Location)
}
export class SimpleObject {
    constructor(public size:string, public color:string, public form:string)
}

The remaining grammar rules

The rest of the grammar consists mainly of lexical rules. Some of the lexical items are synonyms, such as the relations “under” and “below”:

relation --> "under" | "below"
          |  "beside"
          |  ...

The lexical rules return an atom (i.e., a TypeScript string), and synonyms return the same string (i.e., both “under” and “below” return the atom under as parse result).

Interpretation

The parser can return several parse results, and each of them should be interpreted in the current world. This interpretation will hopefully filter out some parse trees that simply cannot be understood, but it might also reveal that one parse tree is ambiguous in the current world.

Assume the ambiguous example utterance from above:

The given grammar assigns the utterance the following two possible parses:

  1. put the white ball that is in a box on the floor:

     MoveCommand(
         Entity("the", 
             RelativeObject(
                 SimpleObject(null, "white", "ball"),
                 Location("inside",
                     Entity("any",
                         SimpleObject(null, null, form:"box"))))),
         Location("ontop",
             Entity("the",
                 SimpleObject(null, null, "floor"))))
    
  2. put the white ball in a box that is on the floor:

     MoveCommand(
         Entity("the", 
             SimpleObject(null, "white", "ball")),
         Location("inside",
             Entity("any",
                 RelativeObject(
                     SimpleObject(null, null, "box"),
                     Location("ontop",
                         Entity("the",
                             SimpleObject(null, null, "floor")))))))
    

The goal of the interpreter is to convert a parse result into a logical interpretation of the final goal of the command. This can be done in two steps:

Goals

Now, assume that the world consists of two balls (LargeWhiteBall, SmallBlackBall), one table (LargeBlueTable), one brick (LargeGreenBrick), and three boxes (LargeRedBox, LargeYellowBox, SmallBlueBox). The white ball, the table and the yellow box are on the floor, while the red box is on the table, the small box is in the yellow box, the black ball is in the small box, and the robot arm is holding the brick. Like this:

Example world

The interpretation of the example utterance might go like this:

Conjunctive goals

The goal inside(LargeWhiteBall, LargeYellowBox) does not say that the box must remain on the floor, which means that the planner is allowed to move LargeYellowBox to another place if it thinks this is a better thing to do.

If we want to ensure that the box is not moved from the floor, we can state this as a conjunctive goal:

(An alternative to using only one floor object, is to use one per column, i.e., floor-0, floor-1, …, until floor-N. You have to decide which one suits best for you).

Disjunctive goals

If, on the other hand, box LargeRedBox had also been on the floor, then there would be two possible boxes that matched. The resulting interpretation would then become:

Now if we want to ensure that the boxes stay on the floor, this can be done via a disjunctive conjunctive (DNF) formula:

Quantifiers

One of the main problems with natural language interpretation is to handle quantification correctly. The Shrdlite grammar can recognize three different quantifiers: the, any and all, and they should all be interpreted differently.

The minimum requirement of your interpreter (and planner) is that it can handle the singular quantifiers (the and any). Furthermore, it is okay if you use the same interpretation for both the and any.

If you want your interpreter to be more interesting, it (and the planner) should also be able to handle the all quantifier, and distinguish between the and any. The main difference between these two is that the implicitly requires that there is only one object that fits the description. If there are more than one possible interpretation of a the quantified object, then the system can, for example, ask a clarification question.

Clarification questions

If the user’s utterance is ambiguous in the current world, i.e., if there are several parse trees that can be translated into goals, then it is helpful if the system can ask a clarification question.

To be able to interpret the answer to a clarification question, the system needs to know which state it was in before it asked the question. For this purpose, it might be necessary for you to extend the definition of the world state.

Detailed description of semantic interpretation

Parsing can be viewed as a function from strings to syntactic structures. In fact, from each nonterminal X in the grammar we can conceptually define a separate parser parseX that takes a string and returns a list of syntactic objects of type X:

function parseEntity(input : string) : Entity[]
function parseLocation(input : string) : Location[]
...etc...

(Note: this is not how the NEarley parser works, but I’m just saying that a parser could work like this).

In a similar way we can view semantic interpretation as a function from syntactic to semantic structures. For each nonterminal X, the semantic interpretation would then be a function interpretX that takes a syntactic object of type X and returns a semantic object XSem:

function interpretEntity(syntax : Entity) : EntitySem
function interpretLocation(syntax : Location) : LocationSem
...etc...

This means that for each syntactic structure (Command, Entity, Location, Object, …) you have to decide what the corresponding semantic type should look like.

The interpretation context

The interpretation can only be done in the context of the current world state. One solution is to create an interpretation class that has the current world state as an instance variable, and let all functions interpretX be methods in that class:

class ShrdliteInterpretation {
    constructor(private currentState : WorldState)
    private interpretEntity(syntax : Entity) : EntitySem {
        ...code for interpretEntity...
    }
    ...etc...
}

Another solution is to add the world state as an argument to the interpretation functions:

function interpretEntity(syntax : Entity, currentState : WorldState) : EntitySem {
    ...code for interpretEntity...
}

So, given a world state (with two small balls a,b, two large balls c,d and two boxes e,f), what is an adequate semantic interpretation of an object description (such as “small ball”)? Should it be the name of a ball? Or a set of balls? Or a set of set of balls? Or…?

Goals

Given the same world state, what should the semantic interpretation of a command (such as “put all small balls in a box”) be? The current signature of the function interpretCommand (in file Interpreter.ts) is like this:

function interpretCommand(cmd : Command, state : WorldState) : string | DNFFormula

If there’s an interpretation error, the function should return a string describing the error, but if it finds an interpretation, then it is returned as a formula in disjunctive normal form. It consists of a list of conjunctions, which in turn consists of a list of literals:

export class DNFFormula {
    constructor(public conjuncts:Conjunction[])
}
export class Conjunction {
    constructor(public literals:Literal[])
}
export class Literal {
    constructor(public relation:string, public args: string[], public polarity:boolean)
}

So, the suggested interpretation is a list of list of “literals”, where a literal is an atomic formula or its negation. Here are some examples of literals, together with how they are written as Literals:

ontop(a,b) Literal("ontop", ["a","b"], true)
holding(q) Literal("holding", ["q"], true)
¬leftof(c,d) Literal("leftof", ["c","d"], false)
p Literal("p", [], true)

So, why a list of list of literals? The idea is that it is a logical formula in disjunctive normal form. The following are translations of some logical formulae in disjunctive normal form:

PQ DNFFormula([Conjunction([P, Q])])
QR DNFFormula[Conjunction([Q]), Conjunction([R])])
(PQ) ∨ R DNFFormula([Conjunction([P,Q]), Conjunction([R])])
P ∧ (QR) …must be converted to DNF first…

Using DNFFormula as the semantic result of a command allows us to distinguish between the following commands:

“put a small ball beside a box” beside(a,e)beside(a,f)beside(b,e)beside(b,f) DNFFormula([Conjunction([beside(a,e)]), Conjunction([beside(a,f)]), Conjunction([beside(b,e)]), Conjunction([beside(b,f)])])
“put all small balls beside all boxes” beside(a,e)beside(a,f)beside(b,e)beside(b,f) DNFFormula([Conjunction([beside(a,e), beside(a,f), beside(b,e), beside(b,f)])])

Exercise suitable for discussion within the project groups:

Ambiguous interpretations

Sometimes an utterance is syntactically ambiguous, and each syntactic structure can have a distinct a semantic interpretation in the current world. An example is the second example in the small world:

This utterance has two possible syntactic analyses, corresponding to the following groupings of “a box”:

Both of these analyses have a distinct interpretation in the small world: Either the user wants the robot to put the black ball (which happens to be in a box) on the floor, or the user means that the black ball should end up in the yellow box (which is the only box that is on the floor).

This utterance is therefore ambiguous, and you have to handle that in some way. One variant is to just report that it is ambiguous and refuse to do anything. But there are of course more elaborate ways of handling ambiguities, such as by asking the user a clarification question. In that case you have to make more modifications to the code.

Interpretation failures

Sometimes a specific syntactic structure does not have a semantic interpretation in the current world. An example is the first example in the small world:

This utterance has two possible syntactic analyses (just as for the black ball). However, the first analysis cannot be interpreted in the small world, since there are no white balls in any boxes. How can we accomplish this?

One suggestion is to let the interpretX(...) functions throw an exception if there is no interpretation. The calling function interpretCommand(...) can then handle this by interpreting all possible parses, and only report an interpretation error if there are no interpretations whatsoever.

Another alternative is to return a string if there is an interpretation error, and the calling function can test if the result is a string or a “real” result. This is the approach we have chosen for the function interpretCommand(...), but you can choose any other approach for the other interpretX(...) functions:

function interpretCommand(cmd:Command, state:WorldState) : string | DNFFormula

Montague semantics

The semantic interpreter in your Shrdlite project will be a very light-weight variant of Montague semantics. Read more about these things in chapter 2 of Ted Briscoe’s introduction to syntax and semantics:

A final note: Perhaps you noticed that the interpreter doesn’t return a list of XSem objects (which the parser does). The reason for this is that you won’t need that in your Shrdlite project (at least not in the basic project), but Briscoe gives in section 2.3 some examples where the same syntactic structure can give rise to several semantic interpretations.

Planning

The planner should take a representation of the goal, and a representation of the current world.

Let’s continue with our example utterance and world as before, giving the same goal:

Now, the planner should take the world representation and the goal and return a sequence of the basic actions (rlpd). Here is a possible solution (assuming that the robot arm starts in position 0):

Or, in other words:

Note that to be able to complete this, the planner also needs a representation of the background knowledge (e.g., which kinds of objects can be on top of, or inside, which objects).

The plan and the GUI

The Javascript GUI wants the final plan as a list of strings, where each string is either a basic action (e.g., r or d), or a system utterance that will be printed as part of the dialogue. Like this:

["First I drop the brick",
 "r", "d",
 "Then I move the black ball",
 "r", "r", "p", "r", "d",
 "Then I move the small box",
 "l", "p", "l", "d",
 "Finally I put the white ball in the yellow box",
 "l", "l", "p", "r", "r", "r", "d"]

Grading

To be acceptable, your program must fulfill at least the following:

Extensions

Your project can receive 0–6 bonus points counting towards a higher grade. For this you must extend the program in some ways, such as:

In normal cases, provided that the code is effective, functional and well-commented, an extension will give 1–3 bonus points, depending on how complicated it is.