URL of this page is www.skeptron.uu.se/broady/dl/palate-proposal-march2002.htm

Personalized Access to Large Text Archives (PALaTe)

Extract from
Wolfgang Nejdl et al, Personalized Access to Distributed Learning Repositories (PADLR). Final Proposal, March 25, 2001, pp. 17-19.

Personalized Access to Large Text Archives (PALaTe)

Contributing Research Groups and PIs:
Uppsala (Borin/Broady)
CID (Broady).

Working Title. PALaTe: Personalized Access to Large Text Archives

Problem Description. Text is still important in the teaching of almost any subject, viz. in the form of textbooks and other course texts. In Languages and Humanities education, (large) textual resources are also quite often objects of study in themselves. Arguably, their effective deployment as study objects in the context of ICT-based personalized learning demands some kind of language understanding. Hence, personalized access and navigation among such resources should – almost by definition – make use of Computational Linguistics (CL) / Natural Language Processing (NLP) techniques, to complement the more general personalization tools which will be developed in the submodule “PLeaSe: Personalized Learning Sequences”.

In this submodule/testbed, we thus consider the issue of personalized access to large text archives in Languages and Humanities education. In order to make the fruits of our labor in the proposed project useable also in other subject areas, we will focus on certain aspects of this issue, namely how (aspects of the) content and difficulty of texts or parts of texts can be inferred and utilized for creating personalized access to text material.

Research plan and deliverables. We will consider the use of two fairly different kinds of large text archives:

1. In language education and linguistics, large text archives are important mainly (but not only!) because of their (linguistic) form. Here, the so-called text corpus has become an important educational (and research) resource. The uses of text corpora in language education are manifold:

as a data source for the preparation of (monolingual or bilingual) word lists, grammars [1, 2, 3], test items (e.g. for diagnostic tests such as the Didax system being developed in the Swedish Learning Lab (SweLL) APE-DRHum project [5, 6]), etc.
as a source of empirical examples in ‘data-driven learning’ [4]. The English Department at Uppsala uses the British National Corpus in this way, and other language departments are getting ready to do the same, e.g. the Slavic Department for use in their Russian courses.
as a source of reading matter, user-adapted as to its level of difficulty and subject area (where content obviously becomes important, too)

2. On the other hand, in such Humanities subjects as History, Literature Studies, History of Science, Teacher Training, etc., large text archives are important mainly because of their content, i.e. because of the information contained in the texts (and, as a rule, the range of languages dealt with will be much smaller; see below).

Typical issues which arise when such text archives are to be used in education (or research) are:

Locating texts or text portions in the archive which deal with a particular person, place or time (‘PPT extraction’). Partly, this is addressed in the field of Information Extraction (under the heading of “name recognition”), but the problem is still a long way from being solved, especially if we take into account— as our ambition should be—the general problem of entity references in text (by noun, pronoun, hyperonymy / hyponymy, etc.). Concretely, this has been an issue both in the work on the electronic version of Swedish author August Strindberg’s collected works at KTH in Stockholm, and in the work with the 17 so-called Wallenberg Interviews (interviews with Jews who escaped from Hungary and Nazi persecution thanks to Raoul Wallenberg) in the History Department at Uppsala. These and other large textual resources would see more use in education—bridging the gap between education and the kind of research for which this education is preparing the students—if the access to the resources could be made less unwieldy.
Selecting texts or portions of texts in the archive which deal with a particular topic, or succession of topics, the latter for assembling a reading sequence out of a larger textual material. For both kinds of text archives, and for many of the issues just listed, methods and tools from the fields of Information Retrieval, Information Extraction and CL/NLP are available.

There are also more open-ended research issues in the list, e.g. the—already mentioned— problem of entity references in text, or that of determining the level of difficulty of a text (for a language learner having a particular linguistic background; see also submodule “Automatic extraction of metadata and ontological information”, where the related issue of “determining the level of information” is discussed). Generally, we believe that the realistic course of action here is to pursue so-called ‘shallow’, or ‘knowledge-light’ techniques for text corpora used in language education, because of their potential application to a large number of languages—Uppsala University currently offers courses at various levels in about 40 languages—which in practice precludes the use of ‘deep’, ‘knowledge-intensive’ techniques. When there are such techniques available (as may be the case for English, German and a few other languages), they should be considered, of course, but developing them from scratch is too costly. For the case of general Humanities textual resources, however, we should consider developing more knowledgeintensive methods for selected problems, such as the ‘PPT extraction’ already mentioned, where there is an expressed need among educators and researchers.

The work with large text archives will proceed along two interconnected lines of research:

1. We will explore the issue of using partial parsing and information extraction techniques for marking text portions for persons, places, and times, and carry out formative evaluation of these techniques in an educational setting. This work will be pursued in collaboration with the work in the submodules “Automatic extraction of metadata and ontological information” and “PLeaSe: Personalized learning sequences”.

Deliverables: Prototype person/place/time partial parser (‘PPT extractor’), and evaluation reports.

2. We will pursue the issue of how to (operationally) define and determine the level of difficulty (or “level of information”; see above) of a text or a portion of a text (for language education purposes it would be useful to be able to determine this even for small linguistic units such as phrases or clauses), and carry out formative evaluation of this definition in an educational setting. This work, too, will be a collaboration with the work in the submodules “Automatic extraction of metadata and ontological information” and “PLeaSe: Personalized learning sequences”.

Deliverables: Preliminary operational definition of level of difficulty (for particular foreign/second language learner), prototype application for determining level of difficulty at least for Swedish and English text material, and evaluation reports.

Dissemination, Testbeds and Evaluation Dissemination of results will be done through reports and scientific publications on the different aspects outlined in the research plan. In general, we plan to do research/development and evaluation in parallel (i.e., formative 18 evaluation), but for obvious reasons, the first year will be devoted mainly to research and development, while the second year will be dominated by deployment and evaluation in regular education. We will use existing courses in the departments of the Faculty of Languages, in the History Department and in the Department of Teacher Education as resources for our requirements analysis and as testbeds for our implementations.

Collaboration and Scholarly Exchange. Strong interactions with the submodules “PLeaSe: Personalized Learning Sequences”, “Automatic extraction of metadata and ontological information” and “Content Archives”.

Budget Overview (including overhead costs): Uppsala: 25K first year, 25K second year. Budget will pay for one part-time Postdoc, and for faculty involvement in testbed integration in regular Languages/Humanities curricula, overhead costs, travel and exchange. CID: 10K first year, 10K second year. Budget will pay for a part-time Ph.D. student, overhead costs, travel and exchange.

URL of this page is http://www.skeptron.uu.se/broady/dl/palate-proposal-march2002.htm
This HTML version created by Donald Broady. Last updated March 2001
Back to Digital Literature Start Page
Back to SEC home page