Phase 1) Media article dataset collection

← Study goals | Phase 2) →

The social media news articles for this study will be sampled from the NewsWhip Insights™ platform. The NewsWhip Insights dataset is based on a privately operated social media crawler, which has been collecting data since the beginning of 2014. The platform identifies media stories, and tracks how those stories are distributed on social media platforms. The stories are then categorized internally into non-exclusive story “types,” including a category for news. Tracking for each story continues for one month after it is published and includes statistics on how many times it is shared on social media networks (e.g. Facebook, Twitter, Reddit, LinkedIn, etc.). The dataset can be queried, using titles, time periods, and content to filter results.

We will query the Insights dataset to generate a list of health news articles pertaining to new research studies published in 2015 (published from January 1, 2015 0:00 EST to December 31, 2015 11:59:59 EST). The search terms for this query are shown below:

(categories:2) AND ((headline_en:”health” OR summary_en:”health”) AND (headline_en:”study” OR summary_en:”study”) OR (headline_en:”research” OR summary_en:”research”))

where categories 2 corresponds to NewsWhip’s internally curated categorization for sites containing news, headline_en is the programmatically extracted headline of the story, and summary_en is the programmatically generated content in English.

We then define “popularity” as the number of times a story has been shared on its respective social network. The two social networks chosen for this study are Twitter and Facebook. We will generate two lists, one each for the 1,000 most shared stories on Facebook and Twitter. Each media story from each list is given a numerical identifier.

← Study goals | Phase 2) →