The following is a description of the method followed in producing code and templates for extraction and network visualisation of social media content. The purpose is development of a set of tools for ‘argument mapping’ – broadly, rhetorical or pragmatic analysis of online discourse with the aid of tables and visual representations of raw data. These instruments should facilitate following particular arguments, helping to identify and observe agency and tactics, and also in identifying patterns in argument in general: its forms.
layout one – agent network
Initial experiments in mapping argument via network visualisation (in the open source software Gephi) indicated that a graph could be used to show networks of agents in relation, or kinds of relation between networks, or focus on the semantics of language used in argument. All of these are of utility, to a degree partly dependent on the research object, and partly on what’s notable or representative about a particular data set. But no one visualisation can adequately represent all of these aspects of a network – or at least not when mapping tens of thousands of agents, posts and connections.
As a consequence, the programme is to develop a number of layouts, effectively different instruments, each with its own minimally variant process, which could be applied to the same data. In some cases one or another layout might be applied. Ideally, though, a set of visualisations using several layouts yields a multi-faceted and so more informative representation of the discourse under analysis.
The code used for this project in argument mapping that extracts social media content is from an R package (tweetr) designed for the purpose. Content is scraped and sent to a table with posts and associated data. A second R package is used to prepare tables that list nodes in the network, and separately ‘edges’, relations between nodes. This table can be exported as a csv and opened in gephi.
The preparation and output of graphs is far from an automated process, once the graphing software is opened – the representations are ‘hand-crafted’, the data teased into shape for legibility. Each individual visualisation must be attended to in this way. For a sense of this, the following image shows the initial visualisation gephi provides.
Running a graphing algorithm (in this case Gephi’s native algorithm) quickly resolves the graph into clusters.
This part of the process doesn’t get old: when an algorithm runs and clusters emerge from the data based on inter-connexions, prominent and active nodes are exposed…a hundred thousand tweets resolve into a form in which it’s possible to make sense of them all at once.
This initial treatment of the data in Gephi is roughly the same each time. What distinguishes the layouts is the way the data is set out.
The adjunct code I’ve written in R is designed to serve up this data in different formats fit for different modes of analysis,for visualising different aspects of networks and argument.
One layout takes agents or accounts as source and target of messages – a series of messages between them strengthens the connection. Another layout takes messages (tweets) as source and target. This alternate layout better captures a series of replies to tweets, or a thread – a string of tweets made when one agent ‘replies’ to a message of their own.
The result to date after pointing the script at the raw data and letting it run is a set of a dozen files rather than one. Some of these can be immediately imported into Gephi, others are designed for intermediate processes such as quantitative semantic analysis and qualitative coding. The different layouts below are the product of differing composition of the data for a different emphasis in each case.
The first layout, as pictured below, is focused on engaged accounts. Tweets, though, are collected on the basis of search terms in the content of tweets, not account by account (it’s possible to just scrape the last 3,200 tweets from any account individually, and build up a data-set that way, there’s a summary of the ethical limits I’ve set myself here).
Data is processed in such a way that each node is a Twitter account, its label a screen-name, or handle.
If there is a tweet from the account, or a reply, or a retweet or quote-tweet, then the account that sent the tweet is recorded as the ‘source’ for an ‘edge’ – a connection.
If the tweet is a reply, a retweet, or a quote-tweet, or includes the name of another twitter account, then the account referred to is recorded as the ‘target’ for the edge.
To make this work nodes list and an edge list are saved as csv files and then opened in Gephi..an entry for an edge list looks like this:
Then Gephi’s graph algorithms arrange nodes and edges:
map of 18,000 tweets and 25,000 edges from a search based on the single word ‘socialism’, from a period of a few hours on Jan 5th 2019.
The most extensive network of tweets I’ve attempted to map thus far included a quarter of a million tweets from 79,000 accounts with half a million connections of one of the four types (reply, retweet, quote, mention) between nodes. All of the tweets were made during a two day period in January, and all referenced one person – a British Tory MP, Anna Soubry.
Mapping a network on this scale seems at the upper end of what either the graphing software or the laptop I’m using can handle – experiment’s needed to work out which. Another limitation of a data set of this size is on its presentation – to show all 10,000 accounts with tweets collected in this search on socialism in a legible way on a screen would clearly require a screen of more than 10,000 pixels on a side. A zoomable vector image is a possibility, but then requires a degree of interactivity, its intelligence not available on quick inspection.
The usual workaround in creating a representation of the data set entire is that nodes with fewer edges can be hidden, leaving more prominent accounts visible. That is, in order even to appear in the graph, an account must refer to or be referred to by another account a number of times, not one. The number of edges a node has is its ‘degree’. I’ve used a ‘degree filter’ of 2, of 10, of 25 in making representations of argument – it’s obviously a feature about which transparency is needed for better comprehension of the end result, so I usually post the degree filter used with the graph.
Another approach to negotiating the scale of large data sets in representation is via alternate or sub-layouts – in this case essentially details of the larger map picking out parts of the network. A sub-layout for present purposes is any layout that uses the same tables of data as its parent but with parameters adjusted in the graphing software in order to present data differently.
US Senator Bernie Sanders and Liz Cheney had an exchange on twitter that day about socialism though without the twitter function ‘reply’, after which many tweeted both of them at once . The small group on the right appears in proximity to Sanders and Cheney because one of them replied to Sanders, the rest is these accounts communing in furious agreement with one another about the horrors of socialism.
This form of sub-layout doesn’t require the alteration of the graph per se, but others that do can give further insight into the data under analysis. The next image is essentially the same graph – based on the same data table, same degree filter (2), and without re-running the algorithm from scratch. Just one change has been made – here the node size and label size are determined by ‘out-degree’ rather than ‘in-degree’.
The difference between out and in degree is just that between being the source of tweets referencing other accounts and being the target of those tweets. Each edge a node has is one degree. But in this graph edges have a direction, out from the node or in to a node. A node, or account, with high ‘in-degree’ count, as in the graphs above, is one that many other accounts are referencing. An account with a high ‘out-degree’ count, as the largest nodes do below, is one that has a high intensity of tweets both using the term searched, and mentioning, replying to, retweeting, etc. other accounts.
There’s various reasons why an account might rank highly in out-degree – one of those is someone using the account for concerted campaigning, another is extended conversation between agents using a group of accounts.
This, then, is the out-degree sub-layout for the agent-network-focused layout. Changing one feature radically changes the graph and its import: rather than the most prominent nodes featuring, measured by the attention of others, the most active nodes are laid bare. There are as many different sub-layouts of this type as there are ways of attributing different weights to the nodes based on the set of edges they and their neighbours are connected by.