Social Media and Prediction: Crime Sensing, Data Integration and Statistical Modelling

Funded by: ESRC under the NCRM Methods Innovation Call

Funding: £194,138

Contributors: Williams, Burnap, Sloan, Rana et al.

This project ran between 2013 and 2015, and aimed to test if social media data could be used to  estimate crime patterns at an aggregate level.  Data were collected over 12 months generating 180 million geocoded tweets and close to 600,000 Metropolitan Police recorded crime incidents.  The ethics of using social media in crime and security research was a key component of the project.

The research developed new data fusion techniques and improved upon existing mathematical models that have used social media data to predict voting patterns, the spread of disease, the revenue of Hollywood movies, and the estimates of the centres of earthquakes. The central hypothesis was that crime and disorder related tweets are associated with actual crime rates on the streets, and these associations will outweigh conventional correlates of crime, such as unemployment and proportion of young people in an area.

Findings include:

  • Each Twitter user is a potential sensor of offline phenomena, such as crime;
  • These sensors observe natural phenomenon – the sights, sounds, and feel of the streets;
  • As in the case of the ‘broken windows’ thesis, these can include minor public incivilities – drinking in the street, graffiti, litter – that serve as signals of the unwillingness of residents to confront strangers, intervene in a crime, or call the police; cues that entice potential predators;
  • Sensors can publish information about local social and physical disorder in four ways: as victims; as first-hand witnesses; as second-hand observers (e.g. via media reports or the spread of rumour), and as perpetrators;
  • These social-actors-as-disorder-sensors have various characteristics.  Some are activated (i.e. publish tweets) based on specific signs, while others are not (based on variation in perceptions of disorder).   Data from these sensors also includes temporal and spatial information.   Sensors are not always switched ‘on’, as they may be offline, working, sleeping etc.  They may also act in ways that make data difficult to interpret and validate (e.g. using sarcasm and spreading rumours).   This means they produce data that are noisier than curated data.  However, the number of sensors is prodigious; over 500 million tweets are broadcast daily from over 500 million accounts; 15+ million of these emanate from the UK;
  • The inclusion of Twitter data increases the amount of variance explained in the crime estimation models;
  • Tweets containing mentions of ‘broken windows’ indicators were positively correlated with criminal damage, theft from a motor vehicle, possession of drugs, and violence in low crimes areas in London;
  • Tweets containing mentions of ‘broken windows’ indicators were negatively correlated with burglary in a dwelling, burglary in a business property and theft of a motor vehicle in high crime areas;
  • This pattern is in line with offline research which suggests discussions of neighbourhood degeneration and local crime issues at community meetings are not representative of local crime problems;
  • It is possible that residents in low crime areas are more sensitive to signs of neighbourhood degeneration and therefore feel motivated to broadcast instances of littering, graffiti and vandalism via social media, while residents in high crime areas are less motivated to express similar observations as they are not out of the ordinary (i.e. residents have become desensitised to neighbourhood decline);
  • Multiple sources of bias (propensity to use Twitter, propensity to tweet about crime issues, propensity to geolocate posts, etc.) are likely to be present and require suitable adjustments to be made before reliable estimates can be drawn using Twitter data in particular.

Key outputs of this research include ‘Crime sensing with big data: the affordances and limitations of using open source communications to estimate crime patterns’British Journal of Criminology, and ‘Towards an ethical framework for publishing Twitter data in social research: taking into account users’ views, online context and algorithmic estimation’, Sociology.

This work is being extended to test if online hate speech improves the estimation of offline hate crime in London and LA County.