Research Papers Library

Automatic generation of sets of keywords for theme characterization and detection

The paper describes a system that automatically detects themes in a textual corpus and characterizes them by sets of keywords, that is, words whose co-occurrence in a paragraph indicates that this paragraph tackles a certain theme. (Pichon and Sébillot, 2000) presents a first version of it where those sets are obtained with the help of the CHAVL hierarchical clustering algorithm, grouping words that have a similar repartition over paragraphs. The weaknesses of the system (quality of the classes highly dependent on manual parameter settings, relevant classes in the classification tree hardly pointed out automatically) are largely reduced here by using a combined classification of the paragraphs based on their lexical cohesion. This new classification first allows to densify the processed data, thus helping CHAVL produce more satisfying classes; it also gives a means to establish an original statistical quality measure that can be exploited both to point out the relevant classes in the tree and to reorganize some of the mergings proposed by CHAVL

Download PDF

airs logo

Association of Internet Research Specialists is the world's leading community for the Internet Research Specialist and provide a Unified Platform that delivers, Education, Training and Certification for Online Research.

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.

Follow Us on Social Media

Book Your Seat for Webinar - GET 70% OFF FOR MEMBERS ONLY      Register Now