A brief description of the method followed in producing code and templates for extraction and network visualisation of Twitter content.
The purpose is development of a tool for ‘argument mapping’ – rhetorical or pragmatic analysis of online discourse with the aid of visual representations of raw data.
Visualisations ought to have both exploratory and explanatory utility.
The scripts I’ve written in R are designed to serve up this data in different formats fit for visualising different aspects of networks and argument. One kind of visualisation takes messages (tweets) as nodes, points on the map. This captures a discussion in such a way that a series of replies to an initial tweet, for example, can be followed. Another kind of visualisation takes agents or accounts as nodes.
The method described on this page is focused on this second kind of visualisation. The result is a map of Twitter communities participating in a discourse on a subject, in which communities are made up of coloured nodes that each represent a twitter account, identified by a handle. For the purposes of the research, each account is a ‘corpus of tweets’: this research isn’t focused on bot discovery or otherwise identifying who is behind an account.
see also: relations network, semantic co-occurrence network, tweet network
Mapping a network of communities
Search and data collection
An R package (rtweet) is used to extract data from twitter’s Rest API. Each search is based on a text search string of up to 500 characters. Searches are developed using Twitter’s regular search function, which yields exactly the same results as the API.
Though there are other approaches available for scraping data
(it’s possible to just scrape the last 3,200 tweets from any account individually, and build up a data-set that way ), using only search strings based on terms is one way of ensuring content and communities remains the focus of the research, rather than individual authors of content (there’s a summary of the ethical limits I’ve set myself here).
Preparation of data
Data is collected and saved, and purpose-written R scripts are then used to prepare csv files that the data visualisation software Gephi can use to build maps of social media networks, maps of tweets, or maps of frequently used terms. Different kinds of visualisation each require scripts to prepare the data for that mode of visualisation, so that developing a new way of looking at the data usually requires a new script that outputs the data-set differently.
Two lists are required for a visualisation. The first lists nodes in the network (tweets or twitter account). The second separately lists ‘edges’, or connections between nodes (replies, retweets, quote-tweets, and mentions). Each edge, each connection, has a source node and a target node.
Gephi constructs a visualisation of the data from these complementary lists.
If an account tweets using the terms in the search string, the data is collected, and the account handle appears on a node or as a node in the map.
If the tweet is a reply, or a retweet or quote-tweet, then there is a connection between two tweets, and between two accounts. The account that replied, or retweeted, or quote-tweeted is the ‘source’ for an ‘edge’ – a connection.
The account that issued the tweet that is replied to, retweeted, etc., is recorded as the ‘target’ for the edge.
To create a network map, a nodes list and an edge list are saved as csv files and then opened in Gephi, where an entry for an edge list looks like this:
Data visualisation
The preparation and output of maps by the method I’ve developed isn’t an automated process, Gephi’s algorithms shape the data, after which it’s processed for legibility. For a sense of this, the following image shows the initial visualisation gephi provides for a data-set constituted of a node list of 17,500 twitter accounts, with nearly 25,000 connections between them in total.
Running a graphing algorithm (in this case Gephi’s native algorithm, Forceatlas 2) quickly resolves the graph into clusters.
This part of the process doesn’t get old: when an algorithm runs and clusters emerge from the data based on inter-connections, prominent and very active nodes are exposed…a hundred thousand tweets resolve into a form in which a bird’s eye view is possible.
From an exploratory point of view, the data’s now where it ought to be – in an environment in which it can be manipulated by algorithm, drilled down into by zooming in, where content of tweets can be read by clicking on nodes or switching back to the database.
From this point the map can be manipulated using filters, coloured, adjusted and reshaped. Other visualisations of the same data-set can be produced to observe different aspects of the discourse under analysis. This aids exploration of the data-set, but also the development of network visualisations with explanatory power for output.
map of tweets from 18,000 accounts, with 25,000 connections from a search based on the single word ‘socialism’, from a period of a few hours on Jan 5th 2019.
Some limitations of visualisations and workarounds.
The most extensive network of tweets I’ve attempted to map thus far included a quarter of a million tweets from 79,000 accounts with half a million connections of one of the four types (reply, retweet, quote, mention) between nodes. All of the tweets were made during a two day period in January, and all referenced one person – a British Tory MP, Anna Soubry.
Mapping a network on this scale seems at the upper end of what either the graphing software or the laptop I’m using can handle – experiment’s needed to work out which. Another limitation of a data set of this size is on its presentation – to show all 10,000 accounts with tweets collected in this search on socialism in a legible way on a screen would clearly require a screen of more than 10,000 pixels on a side. A zoomable vector image is a possibility, but then requires a degree of interactivity, its intelligence not available on quick inspection.
The usual workaround in creating a representation of the data set entire is that nodes with fewer edges can be hidden, leaving more prominent accounts visible. That is, in order even to appear in the graph, an account must refer to or be referred to by another account a number of times, not one. The number of edges a node has is its ‘degree’. I’ve used a ‘degree filter’ of 2, of 10, of 25 in making representations of argument – it’s obviously a feature about which transparency is needed for better comprehension of the end result, so I usually post the degree filter used with the graph.
Another approach to negotiating the scale of large data sets in representation is via details of the larger map picking out parts of the network.
US Senator Bernie Sanders and Liz Cheney had an exchange on twitter that day about socialism though without the twitter function ‘reply’, after which many tweeted both of them at once . The small group on the right appears in proximity to Sanders and Cheney because one of them replied to Sanders, the rest is these accounts communing in furious agreement with one another about the horrors of socialism.
This form of sub-layout doesn’t require the alteration of the graph per se, but others that do can give further insight into the data under analysis. The next image is essentially the same graph – based on the same data table, same degree filter (2), and without re-running the algorithm from scratch. Just one change has been made – here the node size and label size are determined by ‘out-degree’ rather than ‘in-degree’.
The difference between out and in degree is just that between being the source of tweets referencing other accounts and being the target of those tweets. Each edge a node has is one degree. But in this graph edges have a direction, out from the node or in to a node. A node, or account, with high ‘in-degree’ count, as in the graphs above, is one that many other accounts are referencing. An account with a high ‘out-degree’ count, as the largest nodes do below, is one that has a high intensity of tweets both using the term searched, and mentioning, replying to, retweeting, etc. other accounts.
There’s various reasons why an account might rank highly in out-degree – one of those is someone using the account for concerted campaigning, another is extended conversation between agents using a group of accounts.
This, then, is the out-degree sub-layout for the agent-network-focused layout. Changing one feature radically changes the graph and its import: rather than the most prominent nodes featuring, measured by the attention of others, the most active nodes are laid bare. There are as many different sub-layouts of this type as there are ways of attributing different weights to the nodes based on the set of edges they and their neighbours are connected by.