CS598CXZ Advanced Topics in Information Retrieval (Fall 2013)

Instructor: ChengXiang Zhai

| Home | Basic Information | Schedule |
| Readings | Assignments | Project | Resources |


Note: In general, the lecture slides are the best "definition" of the core contents -- the contents to be tested in the midterm. That is, you are expected to understand all the major points, models, and techniques that we have discussed in the class; anything beyond the slides can be regarded as optional. Thus, all the lecture slides are required readings.

Required Readings

  1. IR History

    This is a concise and complete review of the history of IR research up to 2010. It gives an excellent historical view of some of the most important major ideas and techniques and their application impact. Read the whole article without worrying about understanding all the details.

  2. A. Singhal, Modern Information Retrieval: A Brief Overview, In IEEE Data Engineering Bulletin 24(4), pages 35-43, 2001. (Error)

    This is an excellent overview paper of IR (up to 2001 obviously) slightly biased toward empirically effective techniques. Your goal of reading it is to know about the general history of IR and a summary of IR techniques from empirical perspective. Read the whole paper.

  3. Rosenfeld's notes (estimation and information theory)

    The goal of reading these notes is to know about some basic concepts in probability, statistics, and information theory. You should read at least Section 3 of the estimation note and all of the information theory note except for section 1.1.6. You should fully understand the derivation of the maximum likelihood estimate for the binomial distribution, and most of the contents in the information theory notes. If you can't understand these, you may want to read relevant discussions in a textbook on probability and statistics, and a book on information theory. Any book on these topics should be sufficient.

  4. V. Bush, As we may think, 1945 .

    This is truly a classic paper. Read it to appreciate Bush's great vision more than 6 decades ago, which still has NOT yet completely realized today. As a minimum, read everything starting from section 6. This is a required reading for completing assignment #1.

  5. [Salton & Lesk 68]

    This is a classic paper about early SMART experiments. Your goal of reading this paper should be to understand how the authors designed their experiments to test many different hypotheses, and how they did statistical significance tests to check all the hypotheses. You should also know what are the major conclusions drawn in this paper. Read the whole paper.

  6. Sanderson's review of Test Collection Evaluation

    This is a nice review of the test collection evaluation method (i.e., Cranfield method) and research on such an evaluation methodology. Your goal of reading this paper is to understand precisely the major IR measures (including precision, recall, fallout, average precision, precision at k documents, MRR, F measure, nDCG, bPREF, etc) and when to use each. Another goal is to get an overview of research work done in IR evaluation. Read the entire paper, but the most important content is in Chapters 2-4.

  7. [Singhal et al. 96]

    This is a good example of an empirical exploration of retrieval models. Read the entire paper to understand how the pivoted length normalization formulas was derived through experimental study. Read the whole paper.

  8. [Robertson and Zaragoza 2009]

    This is a nice introduction to how one of the most effective retrieval functions, BM25, has been developed and extended later. Read the entire survey.

  9. SLMIR

    This book has a chapter (Chapter 2) on a general survey of major retrieval models and an extensive coverage of statistical language models for IR, which might be useful if you want to have a good picture of all kinds of retrieval models in general. Chapters 3-5 are most useful for understanding language models for retrieval. Chapter 7 is useful for understanding topic models. Read other chapters if interested.

  10. [Fang 07]

    This thesis is a systematic study of a new way of developing retrieval models based on an axiomatic framework with promising research results. Your goal of reading it should be to understand the basic idea of this axiomatic approach and know how formal constraints may help evaluating a retrieval function without doing experiments. Read Chapter 3 and Chapter 4.

  11. Note on KL-div Retrieval Model

    Read the entire note.

  12. Note on EM

    Read the entire note to understand the EM algorithm rigorously.

  13. Introduction to Learning to Rank

    Read whatever you can to understand the basic idea of learning to rank.

  14. [Zobel & Moffat 2006]

    This is an excellent tutorial on the implementation of a search engine. Read whatever you can to understand how inverted index is constructed and how it can be used for scoring documents quickly for a query. Make sure if you know the basic idea of variable length encoding and how it can be used for compression of integers.

  15. MapReduce for Text Processing

    Read chapters 2-4 to understand how MapReduce works and how it can be used for parallel indexing.

Optional Readings

  • [Fuhr 92]

    This is an excellent survey of probabilistic retrieval models with rigorous treatment of all the major ideas up to early 1990s, including early ideas on learning to rank.

  • [Zhai & Lafferty 06]

    This paper gives a general decision-theoretic framework for modeling information retrieval that can cover many existing retrieval models. Read the entire paper except for section 5.3.

    More readings may be added later