Saturday, March 27, 2010
Mining a Year of Speech: a Digging into Data challenge
"Technologies for storing and processing vast amounts of text are mature and well-defined. In contrast, technologies for browsing or mining content from large collections of non-textual material, especially audio and video, are less well-developed. Large scale data mining on text has helped transform the relevant disciplines; the disciplines dealing with spoken language may well reap similar benefits from accessible, searchable, large corpora. This project shall address the challenge of providing rich, intelligent data mining capabilities for a substantial collection of spoken audio data in American and British English. We shall apply and extend state-of-the art techniques to offer sophisticated, rapid and flexible access to a richly annotated corpus of a year of speech (about 9000 hours, 100 million words, or 2 Terabytes of speech), derived from the Linguistic Data Consortium, the British National Corpus, and other existing resources. This is at least ten times more data than has previously been used by researchers in fields such as phonetics, linguistics, or psychology, and more than 100 times common practice"