This is a graduate-level course covering the advanced topics in the growing field of information retrieval (IR) where the goal is to study how to build intelligent software tools to help users management and make use of large amounts of unstructured (typically textual) data. The impact of IR research is most visible from the recent dramatic growth of the Web search engine industry, but applications of IR research also go beyond search engines to text data mining, and intelligent information systems in general. In this course, we will view IR as broadly related to all kinds of applications involving text data.
Text data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. In contrast to non-textual data which are usually generated by physical devices, text data are generated by humans and meant to be consumed by humans. Due to the rapid growth of text data, we can no longer digest all the relevant information in a timely manner. Thus there is a pressing need for developing intelligent software tools to help people manage and make use of vast amounts of text data (“big text data”) for various tasks, especially those involving complex decision-making.
Logically, to harness big text data, we would need to first identify the relevant text data to a particular application problem (i.e., perform text data retrieval) and then analyze the identified relevant text data in more depth to extract any needed knowledge for a task (i.e. text data analysis). Due to the difficulty in natural language understanding by computers, the approaches that work well for text retrieval and text analysis tend to be statistical approaches, especially approaches based on statistical language models which provide a general and robust representation of text data and enable probabilistic and statistical inferences about their content. These approaches are general, robust, and can be applied to text data in any natural language and about any topics. As such, the language model-based approaches would be the focus of this course.
Specifically, this course will provide a systematic introduction to the statistical language models that have been applied to text data retrieval and analysis with an emphasis on thorough explanation of the most useful basic models and their applications so as to ensure students to have a solid understanding of them; programming assignments designed based on a modern text retrieval and analysis toolkit (i.e., MeTA) will enable students to obtain hands-on experience in implementing and experimenting with such basic models so that they will have sufficient knowledge and skills to apply them immediately to solve many real-world application problems. To provide the students with a broader picture of all the state of the art approaches, we will also briefly review representative advanced models and discuss the current trends in research. Students further have the opportunity to work on a course project on a topic of their choice to further extend their knowledge and skills in various ways, including implementing an advanced algorithm, developing a novel application system, or conducting original research on a frontier research topic.