Personal information

David Alan Howard ELWORTHY

Key experience

Software skills

Major programming languages: C, C++, Java. More than 20 years total experience.
Some experience or knowledge of: assembly languages, Perl, Python, Basic, JavaScript, Haskell, Prolog, Lisp, XML/XSLT, HTML, basic web programming.


December 2004 onwards: Google Inc., Santa Monica, CA.

Senior software engineer and tech lead/manager for enterprise search quality.
Tech lead/manager for video search.
Tech lead/manager for a personal search project.

February 2003 - October 2004: Unveil Technologies Inc., Waltham, MA.

Member of core technology group, responsible for natural language processing and machine learning.

January 2001 - January 2003: LingoMotors Inc., Cambridge, MA.

Principal Search Architect
Research, design and development of technologies for search and text categorization, including:

August 1999 - December 2000: Microsoft Research Ltd, Cambridge, UK.

Research Software Design Engineer, Information Retrieval and Analysis Group.
Main area of work: question-answering, based on combining probabilistic information retrieval, large-scale natural language processing and limited inference.
Additional areas of work included evaluation of information retrieval systems, examination of the impact of text processing strategies such as stemming.

November 1995 - July 1999: Canon Research Centre Europe Ltd, Guildford, UK.

Researcher and Project Leader, Natural Language Group.

Information Access (IA) project.

he IA project had the goal of applying natural language processing techniques to information access for Canon products and services, addressing the entire process from content creation, through the retrieval process, to user interfaces. The main result of project was the ANVIL system, which used a combination of natural language and statistical techniques to retrieve textually annotated images from a database with very high accuracy. ANVIL included a novel Web-based user interface, designed to make the system easy for novice users. I started the Information Access initiative, and led the IA project and ANVIL application development, as well as carrying out a substantial part of the research and engineering. ANVIL was implemented in C++ on Unix and Windows, and was successfully transferred to a product development group in both English and Japanese versions.

Major technical areas included:
Major management and co-ordination areas included:
Other areas of work at Canon: language identification, based on accurate computation of statistics with confidence intervals; development of a dependency parser; liaison with other Canon research labs; patents, standards and strategy.

June 1993 - October 1995: Sharp Laboratories of Europe Ltd., Oxford, UK.

Senior Researcher, Natural Language Processing Group.

Integrated Language Database (ILD) project.
The aim of this project, carried out in collaboration with academic and industrial partners, was to develop an environment which allows the construction of large lexical databases encoded as typed feature structures, using supporting evidence from corpus analysis tools. Implementation was in C++ on Sun workstations and IBM PCs, using an object-oriented database.

Principal responsibilities: overall system designer of the ILD; project definition and management;
implementation of the database, user interface and TFS engine for the ILD

December 1992 - May 1993: Computer Laboratory, Cambridge University, UK.

Research Assistant, Acquilex-II Project.

Developed a part of speech tagger, based on a bigram Hidden Markov Model, and written in ANSI C. The program was used for experiments on the practicalities of effective tagging. The tagger provides an efficient, language-independent implementation of the major tagging and training algorithms, and has subsequently been used by a number of other research groups.

July 1991 - October 1991: Rank Xerox EuroPARC, Cambridge, UK.

Summer internship, design rationale group. (During my PhD.)

December 1986 - September 1989: Acorn Computers Ltd, Cambridge, UK.

Programmer/Research programmer.

Esprit Project 860: Linguistic Analysis of the European Languages

Project 860 was concerned with the labelling and analysis of text corpora to support speech processing, and with conversion between orthographic and phonetic representations of isolated words, across six European languages. I worked as a research programmer on the project, on:
RISC OS Project

RISC OS was the proprietary operating system of the Acorn Archimedes. Main work: programming and development of parts of the RISC OS applications suite, with several hundred thousand users; co-writing the RISC OS User Guide.

September 1985 - November 1986: Logica Communications and Electronic Systems Ltd, London, UK.


Member of a large project developing an integrated graphics editor, business charting and plotting package for IBM, to run on modified IBM PCs. Main areas of work included programming of an object-based graphics editor for use on business chart, and coding the directory display component and system message handler.

September 1983 - July 1984: Spectronics MicroSystem Ltd, Cambridge, UK.

Embedded Systems Engineer.


Ph.D. in natural language semantics, October 1989 - November 1992

Computer Laboratory, Cambridge University.
Dissertation: The Semantics of Noun Phrase Anaphora.

Diploma in Computer Science,  October 1984 - August 1985

Computer Laboratory, Cambridge University.
Passed with distinction.
Project: implementation of a functional programming language on special hardware.

BA (Hons) in Engineering, specialising in Electrical Sciences, October 1980 - June 1983

Second class honours, first division
Engineering Department and St. Catharine's College, Cambridge University

Selected Publications

Electronic copies available from Publications page.

The Semantics of Noun Phrase Anaphora
University of Cambridge Computer Laboratory Technical report number 289, 1993. Reprint of PhD thesis.

A Theory of Anaphoric Information
Linguistics and Philosophy, vol.18 no. 3, 1995.
Anaphors are referring expressions such as pronouns, which derive their meaning  primarily from the context. Most work in natural language semantics has been concerned with the meaning of sentences taken in isolation; anaphora requires a significantly different approach. The major theories of anaphora are empirically inadequate in certain respects, and employ ill-understood and unconventional logical formalisms. My work attempted to address this methodological problem, as well as covering a wider range of empirical ground than the existing theories.

Automatic Error Detection in Part of Speech Tagging
Proceedings on New Methods in Language Processing, UMIST, 1994.

Does Baum-Welch Re-estimation Help Taggers?

Proceedings of 4th ACL Conference on Applied Natural Language Processing, Stuttgart, 1994.

Tagset Design and Inflected Languages

Proceedings of EACL SIGDAT workshop "From Texts to Tags: Issues in Multilingual Language Analysis", Dublin, 1995.
Three papers based on systematic experiments in part of speech tagging, aimed at building the practical knowledge to make effective taggers. The ANLP paper has been widely cited.

Language Identification with Confidence Limits
Proceedings of the 6th Workshop on Very Large Corpora, Montreal, 1998.
A technique for identifying the language of a text, based on using statistics with confidence
limits to avoid making a decision without sufficient evidence.

A Finite-State Parser with Dependency Structure Output
6th International Workshop on Parsing Technology, Trento, 2000.

ANVIL: a system for the retrieval of captioned images using NLP techniques.
Third UK Conference on Image Retrieval (CIR2000). Joint paper with T. Rose, A. Kotcheff, A. Clare and P. Tsonis.

Retrieval from Captioned Image Databases using Natural Language Processing

A natural language system for retrieval of captioned images.
Journal of Natural Language Engineering, June 2001. Joint paper with T. Rose, A. Clare and A. Kotcheff.

Based on the ANVIL system. See the entry under Canon Research Center above for details.

Patent applications

A data processing method and apparatus for identifying a classification to which data belongs. US 6,125,362. (Automatic language identification)

Apparatus and method for processing natural language. EP 0 992 919 A1. (Phrase matching for IR)

Apparatus and method for generating processor usable data from natural language input data. EP 1 033 663 A2. (Parsing)

Natural language search method and apparatus. EP 1 033 772 A2. (Search context extraction)

Other professional activites

Local organising committee for IRSG 2000, Cambridge.

Program committees: KDD 2000 Workshop on Text Mining; EMNLP-VLC 2000.

Member of ACL, ACM and SIGIR.