As the amount of online textual information (e.g., web pages, beblogs, tweets, email, news articles, office documents, and scientific literature) grows explosively, it is increasingly important to develop tools to help us manage and exploit the huge amount of information. Web search engines, such as Google and Bing, are good examples of such tools, and they are now an essential part of everyone's life. In this course, you will learn the underlying technologies of these and other powerful tools for managing and analyzing text information. You will be able to learn the basic principles and algorithms for managing, analyzing, and mining text data as well as obtain handson experience with using existing information retrieval toolkits to set up your own search engines and improving their search accuracy. You will also have an opportunity to work on a course project on a topic of your choice related to the course materials.
Unlike structured data, which is typically managed with a relational database, textual information is unstructured and poses special challenges due to the difficulty in precisely understanding natural language and users' information needs. In this course, we will introduce a variety of techniques for accessing and mining text information. The course emphasizes basic principles and pratically useful algorithms. Topics to be covered include, among others, text analysis, text retrieval, text categorization, text filtering, clustering, text data mining, search engine design and implementation, and applications in Web search and mining.
The course is lecture-based. Grading is based on both individual and group assignments, a late midterm examination, and a course project. Those who registered the course for 4 credit hours are required to finish a literature survey on a project-related topic. For more information about the course policy, please see " Basic Information" of the course.