CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

Overview

This is a graduate-level course covering the advanced topics in the growing field of information retrieval (IR) where the goal is to study how to build intelligent software tools to help users management and make use of large amounts of unstructured (typically textual) data. The impact of IR research is most visible from the recent dramatic growth of the Web search engine industry, but applications of IR research also go beyond search engines to text data mining, and intelligent information systems in general. In this course, we will view IR as broadly related to all kinds of applications involving text data.

Text data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. In contrast to non-textual data which are usually generated by physical devices, text data are generated by humans and meant to be consumed by humans. Due to the rapid growth of text data, we can no longer digest all the relevant information in a timely manner. Thus there is a pressing need for developing intelligent software tools to help people manage and make use of vast amounts of text data (“big text data”) for various tasks, especially those involving complex decision-making.

Logically, to harness big text data, we would need to first identify the relevant text data to a particular application problem (i.e., perform text data retrieval) and then analyze the identified relevant text data in more depth to extract any needed knowledge for a task (i.e. text data analysis). Due to the difficulty in natural language understanding by computers, the approaches that work well for text retrieval and text analysis tend to be statistical approaches, especially approaches based on statistical language models which provide a general and robust representation of text data and enable probabilistic and statistical inferences about their content. These approaches are general, robust, and can be applied to text data in any natural language and about any topics. As such, the language model-based approaches would be the focus of this course.

Specifically, this course will provide a systematic introduction to the statistical language models that have been applied to text data retrieval and analysis with an emphasis on thorough explanation of the most useful basic models and their applications so as to ensure students to have a solid understanding of them; programming assignments designed based on a modern text retrieval and analysis toolkit (i.e., MeTA) will enable students to obtain hands-on experience in implementing and experimenting with such basic models so that they will have sufficient knowledge and skills to apply them immediately to solve many real-world application problems. To provide the students with a broader picture of all the state of the art approaches, we will also briefly review representative advanced models and discuss the current trends in research. Students further have the opportunity to work on a course project on a topic of their choice to further extend their knowledge and skills in various ways, including implementing an advanced algorithm, developing a novel application system, or conducting original research on a frontier research topic.

Format

The course will have weekly lectures mostly given by the instructor. There will be frequent short written assignments to help students master the core content covered in the lectures. There will be two in-class midterm exams (one given in the middle of the semester and one given close to the end) to examine the students’ mastery of the core content. The examination questions will be a subset of the questions that the students have work on in the assignments, so the purpose of the exams is mostly to verify that the students have indeed mastered the course materials after working on all the assignments. A few programming assignments will be given to help students learn practical skills in implementing and experimenting with a few most useful algorithms using the MeTA toolkit. Students also have the opportunity to finish a course project on a topic of their choice. Recognizing the diverse needs and interests of the students, we provide the students with the option of working on either a software development project that aims to extend the MeTA Toolkit or a research project that aims to generate a publication. A software development project can extend MeTA by adding either an implementation of an algorithm not already included in the current version of MeTA or an innovative useful system built using functions provided by MeTA. Group projects are allowed and encouraged.

Prerequisite

Students are expected to have a good knowledge of basic probability and statistics in addition to programming skills at the level of CS 225 or a similar programming course. Some background in one or more of the following areas: information retrieval, machine learning, natural language processing, data mining, or databases would be a plus, but not required. If you are not sure whether you have the right background, please contact the instructor.

Administrative

Readings

The required readings for this course are a combination of book chapters, survey articles, and research papers. Most of the readings should be available online (if not, hard copies will be made for you). Specific reading assignments will be posted on the schedule page.

Course Policy and Grading

  1. Attendance

    Attendance is mandatory, but use common sense if you are sick or run into any emergency situation. In case you cannot go to a class, you must send (or ask some one to send) an explanation message to the instructor no later than 24 hours after the class. For example, if you cannot go to a class on Tuesday, you need to send a message before 12:15pm the next day (i.e., Wednesday).

  2. Assignments

    There are two types of assignments. One is short frequent written assignments, which are designed to ensure that every student has a deep and precise understanding of the major core topics; the other is programming assignments, which are designed to provide students with an opportunity to learn about implementation of an algorithm and experiment with an algorithm. The students are required to complete them independently. Discussion with others is allowed to the extent of helping understand the material. The purpose of student collaboration is to facilitate learning, not to circumvent it. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.

    You must exercise academic integrity. Make sure that you have read the University Policy on Academic Integrity, especially the section on plagiarism.

    Late submission of an assignment would result in a reduced grade for the assignment, unless an extension has been granted by the instructor. An assignment is worth full credit at the beginning of class on the due date (later if an extension has been granted). It is worth at most 90% credit for the next 24 hours. It is worth 75% credit for the following 24 hours. It is worth 50% credit after that. If you need an extension, please ask for it (by sending email to the instructor) as soon as the need for it is known. Extensions that are requested promptly will be granted more liberally. You must turn in all assignments.

    By default, students should submit all assignments via Compass in PDF files unless otherwise specified. The scores of all assignments will be released via the Chara Gradebook.

  3. Midterm examinations

    There will be two midterm exams to be given in the middle of the first half and the second half of the semester, respectively. They purpose is to ensure that students have a good understanding of all the core topics covered in the course. The questions in both exams will be very similar to the questions in the assignments. Both exams will be given in the classroom at the time of our class meeting, lasting for 75 minutes.

  4. The course project

    Students are required to finish a course project on a topic of their choice. Group projects are allowed and encouraged.

    Depending on their interests and expertise, students can choose to either work on a software development project that aims to extend the MeTA toolkit or a research project that aims to generate a publication. A software development project can extend MeTA by adding either an implementation of an algorithm not already included in the current version of MeTA or an innovative useful system built using functions provided by MeTA. See the project page for more details.

  5. Grading

    Grading will be based on the following weighting scheme:

    • Assignments: 30%
    • Midterm exam 1: 20%
    • Midterm exam 2: 20%
    • Project: 30%