Self-Reported Mental Health Diagnoses (SMHD) dataset

The SMHD (Self-Reported Mental Health Diagnoses) dataset consists of Reddit posts of users who have claimed to have been diagnosed with one or several of nine mental health conditions ("diagnosed users"), and matched control users. All posts made to mental health-related subreddits or containing keywords related to a mental health condition were removed from the diagnosed users' data; control users' data do not contain such posts due to the selection process.

SMHD contains nine mental health conditions with diagnosed users.

Condition Total Users Total Posts Users (train)Users (dev)Users (test)Users (RC)
ADHD 10,098 872K 1,768 1,747 1,779 4,804
Anxiety 8,783 795K 1,711 1,593 1,675 3,804
Autism 2,911 248K 479 480 517 1,435
Bipolar 6,434 575K 1,216 1,182 1,247 2,789
Depression 14,139 1,272K 2,662 2,574 2,611 6,292
Eating 598 53K 104 115 112 267
OCD 2,336 203K 409 477 390 1,060
PTSD 2,894 258K 528 516 558 1,292
Schizophrenia 1,331 123K 238 278 267 548
Control 335,952 116M 92,725 92,421 94,415 56,391

Further dataset construction details are available in Section 3 of the COLING 2018 paper SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions.

Information on obtaining this dataset can be found here.


Citation


  @InProceedings{cohan2018smhd,
  author    = {Cohan, Arman and Desmet, Bart and Yates, Andrew and Soldaini, Luca and MacAvaney, Sean and Goharian, Nazli},
  title     = {SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions},
  booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING)},
  year      = {2018},
  publisher = {Association for Computational Linguistics},
  pages     = {1485–-1497},
  url       = {https://www.aclweb.org/anthology/C18-1126}
  }

Contact Information

For any comments or questions, please email Arman, Bart or Andrew.