Resources
-
Health Surveillance Framework – A framework for monitoring public health. [code]
A Framework for Public Health Surveillance
Andrew Yates, Jon Parker, Nazli Goharian, and Ophir Frieder
Language Resources and Evaluation (LREC 2014) -
QuickUMLS – An unsuerpvised biomedical concept extraction tool. [info, code]
QuickUMLS: a fast, unsupervised approach for medical concept extraction
Luca Soldaini and Nazli Goharian
Medical Information Retrieval (MedIR) workshop at SIGIR, July 2016 -
Tree-LSTMs for Scientific Relation Classification – Code for extracting semantic relations from scientific text. [code]
Tree-LSTMs for Scientific Relation Classification.
Sean MacAvaney, Luca Soldaini, Arman Cohan, and Nazli Goharian
International Workshop on Semantic Evaluation (SemEval 2018) -
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents [code]
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, Nazli Goharian
2018 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2018) -
Neural Ranking Weak Supervision – Train neural rankers using weak relevance pairs. [code]
Content-Based Weak Supervision for Ad-Hoc Re-Ranking.
Sean MacAvaney, Andrew Yates, Kai Hui, and Ophir Frieder
ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) -
CEDR – Using contextualized embeddings for document ranking. [code]
CEDR: Contextualized Embeddings for Document Ranking.
Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian
ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) -
OpenNIR – An end-to-end neural ad-hoc ranking pipeline. [info, code]
OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline
Sean MacAvaney
Web Search and Data Mining (WSDM 2020, demo) - ir_datasets – An simple python and command-line interface to several IR datasets. [code]
- OCR Correction – Experimental work to increase the accuracy of ocr by post-processing documents. [code]
- Storing Tweets – Code for storing tweets from the Twitter streaming API. [code, filtered by tweets from North America]
-
TBD3 - A Thresholding-Based Dynamic Depression Detection from Social Media for Low-Resource Users [code]
TBD3: A Thresholding-Based Dynamic Depression Detection from Social Media for Low-Resource Users
Hrishikesh Kulkarni, Sean MacAvaney, Nazli Goharian, Ophir Frieder
Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022) -
On Generating Extended Summaries of Long Documents [code]
On Generating Extended Summaries of Long Documents
Sajad Sotudeh, Arman Cohan, and Nazli Goharian
AAAI-21 Workshop on Scientific Document Understanding (SDU 2021) -
TSTR - Too Short to Represent, Summarize with Details! Intro-guided Extended Summary Generation [code]
TSTR: Too Short to Represent, Summarize with Details! Intro-guided Extended Summary Generation
Sajad Sotudeh, and Nazli Goharian
2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022)
Datasets
- Reddit Self-reported Depression Diagnosis (RSDD) – Posts from thousands of Reddit users who claim to have been diagnosed with depression, and carefully-selected control users.
- RSDD-Time – Temporally-annotated subset of diagnosed RSDD users with information such as how long ago the diagnosis was made and whether the condition persists.
- Self-reported Mental Health Diagnoses (SMHD) – Reddit posts from thousands users who identify as having one or more of 9 mental health conditions, and carefully-selected control users.
Obtaining datasets
RSDD, RSDD-Time, SMHD contain only publicly available Reddit posts. Posts may contain information related to users' health, however, and are thus sensitive. To protect users' privacy, researchers who wish to obtain the dataset must sign a data usage agreement. A single agreement may grant access to any (or all) of the datasets.
Succinctly, the agreement requires that researchers
- make no attempt to contact any user in the dataset
- make no attempt to deanonymize or learn the identity of any user in the dataset
- make no attempt to link users in the dataset with any external information (e.g., an account on another website)
- do not share any portion of the data, including example posts or excerpts from posts, with any other party
Researchers interested in obtaining the datasets may submit a data request form to be provided with the data usage agreement and further information on obtaining the data.