ir@Georgetown::Projects

Information Retrieval Lab Projects

Detecting Relationships among Categories
Opinion Mining and Sentiment Analysis of User Reviews
Query Session Analysis
Spam Detection in Short Text (SMS)
Contextual Search
Personalized Ranking Twitter Friends: Who to follow
Domain Specific Search & Mining

Detecting Relationships among Categories
Knowledge of relationships among categories is of interest in different domains such as text classification, content analysis, and text mining. We propose and evaluate approaches to effectively identify relationships among document categories. Our proposed novel method capitalizes on the misclassification results of a text classifier to identify potential relationships among categories. This leads to a relationship network. We demonstrate that our system detects such relationships, even those relationships that assessors failed to identify in manual evaluation. Furthermore, we favorably compared the effectiveness of our methods with the state of art method and demonstrated a significant improvement in precision and recall. Furthermore, we are interested to discover interesting relationships in the existing hierarchical knowledge representations. The hierarchical nature of existing Web directories, ontologies, and folksonomies, are known to provide meaningful information that guide users and applications. We hypothesized that such hierarchical structures provide richer information if they are further enriched by incorporating additional links besides parents, and siblings, namely, between non-sibling nodes. We call such structure a networked hierarchy. Our empirical results indicate that such a networked hierarchy introduces interesting links between nodes (non-sibling) that otherwise in a hierarchical structure are not evident.

N. Goharian, S. Mengle, Networked Hierarchies for Web Directories, 20th International World Wide Web conference (WWW'11), March 2011.

S. Mengle and N. Goharian, Detecting Relationships among Categories using Text Classification., Journal of American Society for Information Science and Technology (JASIST), 61 (5), May 2010.

S. Mengle, N. Goharian, A. Platt, Discovering Relationships among Categories using Misclassification Information., ACM 23rd Symposium on Applied Computing (SAC), March 2008.

Opinion Mining and Sentiment Analysis of User Reviews
As the popularity of online user reviews continues to increase, it is becoming increasingly difficult for potential customers and even business owners to understand what aspects business reviewers cared about and how the reviewers felt about those aspects. Many websites allow and even encourage people to submit reviews of various products and services. The text within these reviews often contains valuable information not found in a single 1-5 "star rating". My research proposed and evaluated a novel approach to efficiently model and analyze the text within user reviews to estimate how much reviewers care about different aspects of a product (i.e., amenities, food, location, room, etc. of a hotel) by estimating the aspects' weights. A vector of aspect weights synthesizes the average customer's preferences and expectations as well as the product's actual performance, thus providing a way to characterize the subject of the reviews. This approach performs statistically similar to, and arguably better than, the best existing method, but with significantly lower computational complexity (linear time). While the current domain of this research is a hotel review data set, this method is not domain-specific and should work for other types of reviews. This work is in collaboration with the Chief Scientist at Orbitz Worldwide.

A. Yates, N. Goharian, W. Yee, "Semi-supervised Sentiment Analysis: Merging Labeled Sentences with Unlabeled Reviews to Identify Sentiment", American Society for Information Science and Technology (ASIST), Nov 2013.

J. Parker, A. Yates, N. Goharian, W.-G. Yee, "Efficient Estimation of Aspect Weights", In proceedings of ACM 35th Conference on Research and Development in Information Retrieval (SIGIR'12), August 2012.

Query Session Analysis
We developed and evaluated our approach that utilized our earlier research on identifying the relationships among topics, now to understand the topic of user queries and intent given sequence of user queries from a session or multiple sessions. The context of the session queries is utilized to improve the effectiveness of identifying the intent or topic of current query. Earlier efforts utilized fixed number of preceding queries to derive such contextual information. We proposed and evaluated an approach (DQW) that identifies a set of "unambiguous" preceding queries in a dynamically determined window to utilize in classifying an ambiguous query to a topic. Furthermore, utilizing a relationship-net (R-net) that represents relationships among known topics, we improved the classification effectiveness for those ambiguous queries whose predicted topic in this relationship-net is related to the topic of a query within the window. Our results indicated that the hybrid approach (DQW+R-net) statistically significantly improves the Conditional Random Field (CRF) query classification approach when static query windowing and hierarchical taxonomy are used (SQW+Tax), in terms of precision (10.8%), recall (13.2%), and F1 measure (11.9%). The findings of this research can improve our understanding of user query intent and consequently the search results.

N. Goharian, S. Mengle, "Context Aware Query Classification Using Dynamic Query Window and Relationship Net", In proceedings of ACM 33rd Conference on Research and Development in Information Retrieval (SIGIR’10), July 2010.

Spam Detection in Short Text (SMS)
Spam detection has historically focused on email spam. However, with ever increasing sources of short texts, on the order of 10s of characters, such as in twitter and mobile phone texting, it is important to be able to detect spam where the text provides such little information. We examined the affect of various text-based features such as various character-grams, word grams, length, and specific words such as “rate”, “award”, etc. to classify spam. We found that simple textual features such as n-character grams are good indicators. We are interested to enhance the work and potentially to apply in other domain.

Z. Tan, N. Goharian, M. Sherr, "$100,000 Prize Jackpot. Call Now! Identifying the Pertinent Features of SMS Spam", In proceedings of ACM 35th Conference on Research and Development in Information Retrieval (SIGIR'12), August 2012.

Contextual Search
There has been a growing interest in contextual (personalized & location-specific) search. We propose a learning to rank model that combines general, city-specific, and personalized information. This model is used to produce a personalized and city-specific resultset by reranking location-specific results retrieved from the open Web.

A. Yates, D. DeBoer, H. Yang, N. Goharian, S. Kunath, O. Frieder,
"(Not Too) Personalized Learning to Rank for Contextual Suggestion", TREC 2012 Contextual Suggestion Track, November 2012.

Personalized Ranking Twitter Friends: Who to follow or not to follow
One of the challenges for the users of social media, such as in Twitter, is the fast growing number of people each user is following. The features available in Twitter provide meaningful information that can be harvested to provide a ranked list of "friends" (i.e., followees) to each user. We hypothesize that retweet and mention features can be further enriched by incorporating both temporal and additional/indirect links from within user's community.

Y. Zhu and N. Goharian, To Follow or Not to Follow: A Feature Evaluation., 22nd International Conference on World Wide Web (WWW), May 2013 (short).

Domain Specific Search & Mining: Mining Social Media for Healthcare
Online discussions of virtually all topics are increasing; this phenomenon is ever more so in the domain of healthcare. Individuals today are rapidly and steadily posting remarks regarding their individual and their loved-onesÃ¢ health on a diversity of social media. Given these publicly available statements, there is interest and potential to harness these sources to further our knowledge and understanding about drug behavior. We focus on using several drug related and other general social media sites, query analysis, peer-to-peer, and Web sites to detect expected and unexpected adverse reaction to drugs and devices. To understand users intentions, we utilize consumer medical terminology from UMLS and various other approaches to generate an adverse reaction synonym set that we use to identify both expected adverse reactions, as already recorded by the FDA, and unexpected adverse reactions mentioned in online reviews. ADRs Background (drug) language is utilized to evaluate the strength of the detected unexpected ADRs. Existing synonym discovery methods perform poorly when faced with the realistic task of identifying a target term's synonyms from among many candidates. We approach domain-specific synonym discovery as a graded relevance ranking problem in which a target term.s synonym candidates are ranked by their quality. In this scenario a human editor uses each ranked list of synonym candidates to build a domain-specific thesaurus. We evaluate our method for graded relevance ranking of synonym candidates and find that it outperforms existing methods. Currently, we are enhancing the system.

A. Yates, J. Parker, N. Goharian, and O. Frieder, "A Framework for Public Health Surveillance", In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), May 2014.

A. Yates, N. Goharian, O. Frieder, "Relevance-Ranked Domain-Specific Synonym Discovery", in Proceedings of the 36th European Conference on Information Retrieval (ECIR '14), April 2014.

E. W. Burger, H. Federoff, O. Frieder, N. Goharian, A. Yates, "Social Media Communications Networks and Pharmacovigilance: SequelAE-2.0'), IEEE 15th International Conference on e-Health Networking, Applications and Services (Healthcom), Oct 2013, (short).

J. Parker, Y. Wei, A. Yates, N. Goharian, O. Frieder, "A Framework for Detecting Public Health Trends with Twitter", The 2013 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, Aug. 2013.

A. Yates, N. Goharian, O. Frieder, "Extracting Adverse Drug Reactions from Forum Posts and Linking them to Drugs", SIGIR Workshop on Health Search and Discovery, July-Aug 2013.

A. Yates, N. Goharian, and O. Frieder, Graded Relevance Ranking for Synonym Discovery, 22nd International Conference on World Wide Web (WWW), May 2013 (short).

A. Yates and N. Goharian, ADRTrace: Detecting Expected and Unexpected Adverse Drug Reactions from User Reviews on Social Media Sites, in 35th European Conference on Information Retrieval (ECIR), 2013. (short)

A. Yates, N. Goharian, Mining Social Media for Healthcare, ICBI Biomedical Informatics Symposium at Georgetown University, Best Poster Award (1 out of 28), Oct 2012.