Meta-data for two online ideological communities 

For more information, please context Professor Adam Joinson - A.Joinson@bath.ac.uk at the University of Bath, UK
Data owner: Adam Joinson

Brief description:
Data was collected from two ideological communities. For Community A, 24-months worth of metadata is stored here and for Community B 6-months of metadata is stored. Please note: content from the forums is not included in this dataset, nor are the names of any threads, or sub forums. 

Community A has been split into four 6-month .csv files:
* CommunityA_6_0.csv - denoting time of data collection minus to 6 months containing 1,631 users
* CommunityA_12_6.csv - denoting minus 6 months to minus 12 months (from time of data collection) containing 1,495 users
* CommunityA_18_12.csv - denoting minus 12 months to minus 18 months (from time of data collection) containing 1,458 users
* CommunityA_24_18.csv - denoting minus 18 months to minus 24 months (from time of data collection) containing 1,293 users

Community B has only one file, from time of data collection to minus 6 months: CommunityB.csv, containing 849 users

Each file contains the following fields:
* id - a unique hash ID [Community B just as the "instance number"
* InDegree - total number of unique network neighbours replying to (or quoting) a user
* OutDegree - total number of unique network neighbours receiving posts from (or being quoted by) a user
* TotalPosts - total number of posts of a user
* MeanWC - mean average word sound for all users posts
* ThankRate - mean average number of thanks per post. Calculated as: total number of thanks received / total number of posts made 
* PercQues - percentage of a users posts that contain question marks (excluding within URLs)
* PercUrls - percentage of a users posts that contain URLs
* MeanPostsPerThread - total number of posts / number of threads participated in
* InitiationRatio - number of threads initiated / number of threads participated in
* MeanPostsPerSubForum - total number of posts / number of sub forums participated in
* PercBiNeighbours - number of neighbours a user has both received posts from and posted replies to / total number of unique network neighbours 
* cluster - the cluster allocations from weka (exactly how this was done can be found here: Davidson, Jones, Joinson, and Hinds, (pending revisions), The evolution of online ideological communities, PLOS ONE.


Data collection method:
Metadata have been collected from online communities for this work using a technique called screen scraping. Screen scraping is broadly analogous to large scale automated cut-and-pasting of web pages.  Scraping has been conducted during this work using a custom PERL/MySQL tool. The tool collects data securely from the Internet utilising a Privoxy/Tor Chain. Where needed for forum access, cookies were supplied with HTTP requests made by the tool, but these were rotated regularly to ensure the maintenance of anonymity.  Attempts of forum software to block scraper access based upon known proxy IP addresses were countered automatically by the tool using IP rotation. HTTP page loading errors were handled during scraping, and a small number of retries were attempted where possible to prevent unnecessary loss of data. Data capture errors in scraping were primarily detected via validating that the correct number of fields of each type had been identified and extracted by the HTML parser and regular expressions from each page. All validation errors, and every URL scraped, and cookie and IP rotation etc were all logged and retained to monitor data scraping accuracy. 
NO scraping behind logins was conducted, and only publically available data was collected. The content of the postings was stored in a MySQL database.

All posts from Community A were scraped, which yielded a dataset of: 1,494,464 posts from 11,778 users4 from 32 discrete subforums, covering a timeframe of over 10 years from 2001 to 2011. However, the data deposited here ONLY contains the most recent two-years of this data (2009-2011). 

All posts from the Community B were scraped, which yielded a dataset of: 485,299 posts from 3205 users across 25 different subforums, covering a time period from 2004 to 2011. However, the data deposited here ONLY contains the most recent 6 months of this data. 


Data processing and preparations:
Table 1 below displays the metrics collected for individual users during the scraping process.  This data allowed us to derive three types of behavioural metrics for use (see Table 2), 
1. metrics about individual users (total no. of thread starts, total no. replies, total no, times quoted, period of time active within the community, mean and modal post rates, post bursts and timing of posts), 
2. metrics about the direct relationships between users that was derived from the data collected (who quotes whom, who relies to whom and who thanks whom), 
3. network metrics derived from patterns of the direct relationships


Table 1
Name, Type, Description
ID, int, auto incremented primary key
time, datetime, timestamp of post
unixtime, int, time of post in unix format
poster, tinytext, poster's username
content, text, message content (minus quote if present)
quote, text, quoted message content if present
quoter, tinytext, username of person quoted
quoted, tinyint, boolean... 1 - quote present, 0 quote not present
postwordcount, int, WC of content field 
quotewordcount, int, WC of quote field
seniority, text, total number of posts made by poster
reppower, int, reputation power (weighting of ability to affect other's reputation)
reputation, int, reputation score
numthanks, int, number of thanks received for current post
thanks names, text, usernames of those thanking current post, CSV
thanksids, text, userids of those thanking current post, CSV
joindate, datetime, date poster joined (timestamp)
tendency, text, tendency of poster
url, tinytext, permalink to current post
thread, tinytext, VB name of thread
postnum, int, position of post within thread i.e. 1=thread starting post
userid, int, VB userid of poster
subname, text, name of subforum of current post
subcode, text, VB code of subforum of current post



Table 2:

Structural Features
-In-Degree - Total number of unique network neighbours replying to (or quoting) a user
e.g. Figure 2 : User A In-Degree = 3
-Out-Degree - Total number of unique network neighbours receiving posts from (or being quoted by) a user e.g. Figure 2 : User A Out-Degree = 1

Content Features
-Word Count - Mean average word count for all of a users posts
-Percentage question marks - Percentage of a users posts that contain question marks (excluding within URLs)
-Percentage URLs - Percentage of a users posts that contain URLs

Popularity Features 
-Thank Rate - Mean average number of thanks per post.  Calculated as: Total Number of Thanks Received / Total Number of Posts Made

Initiation Features
-Initiation Ratio - Number of threads initiated / Number of threads participated In

Diversity Features
-No. Threads - Total number of threads participated in
-No. Sub Forums - Total number of sub-forums participated in

Persistence Features
-Posts Per Sub Forum - Total number of posts / Number of sub forums participated in
-Posts Per Thread - Total Number of posts / number of threads participated in

Reciprocity Features
-Percentage Bi-directional Neighbours - Number of neighbours that a user has both received posts from and posted replies to / Total number of unique network neighbours

