Preliminary ethical guidelines for research involving extraction of public tweets for network visualisation.
It seems an open question whether reasonable ethical boundaries ought to prohibit this research altogether, based as it is in collection of data without explicit consent.
One way to frame this question is by considering whether tweets ought to be regarded as published material (I didn’t seek consent to analyse dozens of fictional texts for PhD research, or to use published academic writing) or something more akin to private conversation. Social media posts could be taken as lying somewhere between these two.
The issue of consent and risk to those whose content is incorporated into network visualisations subsequently made public is not the sole ethical consideration for the project, but is the most substantial. .
1. constraints on data collection
collection of tweets only on the basis of what’s returned in searches for terms or sets of terms, using twitter’s public search syntax (that is, tweets collected are the same that would be listed by typing terms into the search box on a twitter app or web page. No bulk scraping tweets by user account, extracting follower lists, or favourites, etc.)
2. constraints on data retention
retention of the minimum amount of data, based on the principle that data related to posts should be retained, accounts to a lesser extent, but data directly related to users will not be – no geographical or geo-location data, screen names collected but names not. If an account-holder makes their account private, or closes their account, that act would also cut links (at least overt links) to the data set.
3. security and integrity of data
raw and processed data retained and backed up on matched, encrypted drives or volumes.
4. apprehension of risk in context
consideration case by case (in selecting topics and devising search terms) of potential risk for agents posting content, however unlikely, taking account of power relations within and beyond the network mapped. Consult if ill-equipped to judge. Possible responses to risk include:
not collecting data
not publishing representations of the data
excluding labels with screen-names or text on nodes, making accounts more anonymous.
5. misrepresentation, overstatement, understatement
all care taken to avoid any manipulation of the data that might lead to distorted perception of argument from representations of data or analysis of it.
as an adjunct to the previous condition, an undertaking to be transparent and garrulous enough about sample and method to give a fair sense of what’s being represented in visualisations.