Google: Meta-communities of practice in the code-sharing commons

Lead Research Organisation: Lancaster University
Department Name: Sociology

Abstract

The extraordinary proliferation of software projects in the last two decade defies easy analysis. Software saturates everyday life (Dodge 2011). This research project uses data generated as people work on online code repositories and use programming Q&A sites to analyse the diversity and homogeneities of programming practice. While much recent debate around software, and the digital economy more generally, has attended to issues of intellectual property associated with open source or free software, this project aims to construct a new evidence-base for debate on these and other issues. It treats programming as a set of practices amenable to social scientific analysis both in terms of common patterns (patterns of coordination, development of group structures) and in terms of the migration and recombination of these practices across different settings. The aim of the research is to bring to light the diversity and sameness of practices within what we call the meta-community of programming practices. The term 'meta-community' refers to the dispersed yet linked communities of practice that populate contemporary software cultures.
Our research will draw on two major kinds of publicly accessible data produced when people make or code software together. The first kind of data comes from code repositories. Online code repositories support the collaborative production of software. A second main kind of site addressed in this project complements the first: the programmer question and answer site StackOverflow. This site collaboratively filters a huge range of very specific questions about how to write code, and how to use existing software libraries and other pieces of code. It is widely regarded as the primary site of online help for programming and software-development problems.
The major repositories such as Github, SourceForge, Launchpad, bitbucket and Google Code host millions of projects. These projects display very diverse shapes and forms, and an extremely wide range of topics, interests, investments, commitments and uses. Online code repositories supply extensive and fine-grained data on the projects, community membership and a range of actions associated with coding. They also publish the source code for the software itself. While this data is published largely in order to facilitate project management and coordination, we see the repositories as a source of rich data on how software is made in practice, and as exemplary sites to investigate different practices associated with the digital economy.
Our research will involve writing scripts to gather data from these sites, and then organising this data in a large relational database. We will focus on data relating to project topics, data stemming from certain kinds of actions (such as 'commit-actions'), data describing overlaps or dependencies between piece of code ('shared dependencies), and data derived from queries about how to write code ('practice intensity'). In order to integrate and find patterns in these datasets, we will draw on well-documented techniques of machine learning. Although these have been relatively little-used in the social sciences, we see them as an important avenue for exploring heterogeneous datasets.
We plan to disseminate the research in novel form. We will bring together a snapshot of the database, the scripts used to generate, analyse and visualize the data, and our written analysis in a virtual machine that can be downloaded and run on most computers. Virtual machines (platform-independent software environments) have recently appeared in the genomic sciences as a way of addressing problems of reproducibility in data intensive research (Encode Consortium 2012) By distributing data, code, visualization and analysis in a coherent, independent package, perhaps for the first time in contemporary social science practice, we will facilitate reproduction of the analysis and exploration of data by interested users inside and outside academia.

Planned Impact

The project focuses on ways of tracking practices through datasets generating during programming practices such as writing code. Given the timespan and close focus on methodological development, immediate beneficiaries of this research are located within academic or near-academic settings in the social sciences.
We have identified several non-academic beneficiaries. They include practitioners of software development in various settings ranging across industry, government, commerce, media, finance, and innovation settings. The other main beneficiaries include industry advisory groups such as the BCS Open Source Software Specialist Group, and the UK-based Open Source Consortium. Software and IT- industry analysts might also benefit from our research. Finally, computer science and software engineering disciplines are academic beneficiaries through which the research can have indirect economic and social impacts.
We regard software developers of all varieties as important beneficiaries. The research will have much significance for programming communities. Analyses of the popularity of programming languages attract much attention in online programmer forums and newsites (extensive coverage of TIOBE programming language index (TIOBE Software 2012) on Slashdot.org, a newsite widely read by programmers and software developers). This research will offer a more richer description of programming practices over time. We plan to present intermediate results of the analysis at industry software conferences such as the annual PyCon conference. As various results, visualizations and summaries of this project are published on the project research blog, the study should attract sustain social media attention and interaction with us via blog comments, and Twitter conversations. Our practices of dissemination (virtual machine via FTP and USB sticks, presentations at user groups and industry conferences, online documentation in blog, use of code repository for our code) are, moreover, designed to engage and to sustain interactions with these beneficiaries.
A range of industry and policy advisory groups are public beneficiaries. Groups who promote open source software development, or provide advice on software-related activities to government and industry, are potential users of this research. In the software industry, as well as in various media and scientific settings, software development is a core innovation process. In its analysis of the diversity of programming practices, as reflected in code repositories and programming Q&A sites, the research will offer a uniquely comprehensive and comparative analysis of the importance and relevance of specific software practices (e.g. programming languages, computing platforms, application domains, etc.). It will also furnish an evidence-base for the development of policy advice in many areas affected or dependent on software development. While we do not see ourselves as delivering the kind of analysis supplied by industry analysts (e.g. Gartner Consultants or RedMonk), there are potential overlaps between their interests and the knowledge produced here.
The research is in interdisciplinary dialogue with computer science and software engineering. These disciplines can have a more immediate impact on software development practices, partly through their training-educational activities, and partly through the pursuit of their own research priorities. In relation to users in academic disciplines interested in the domain of software development (software engineering), we think that our research could potentially open up different training priorities, and different perspective on what they teach. The research might also identify interesting research needs and problems that could be addressed by the development of software tools, or other technical innovation. These kinds of impacts are most likely to occur outside the time-frame of the funded research.

Publications

10 25 50
publication icon
Fuller, M. Goffey, A. Mackenzie, A. Mills, R. Sharples, S. (2016) Archives in Motion

publication icon
Mackenzie, A. (2015) Data

publication icon
Mackenzie, A. (2016) Code and the City

 
Description The focus of this project has been on how software is made today. We have developed perhaps for the first time some ways of exploring the sheer volume and variety of software devices appearing in contemporary social fields. The key achievements of this work are:
1. By using data analytic techniques, we have begun to develop ways of mapping the relations between and relevance of the millions of software projects found on the code repository platform http://github.com. We have identified significant links between patterns of coding practice, forms of social organisation, and the underlying infrastructures and platforms shape the development of code. We have also begun to show how these links generate flows of imitations, copying and at times contagion around and through software.·
2. The research addressed important methodological problems in working with large volumes of data provided by contemporary social media platforms. Using a combination of qualitative and quantitative methods (including machine learning classification techniques), a selection of cutting edge analytic tools (Google BigQuery, and Google Compute) and working with live datastreams using programming languages (R and Python) to query, transform, aggregate and visualize event-form data, we developed ways of delving more deeply into the sometimes noisy and complex flows of data and events associated with large-scale media and collaborative platforms. We experimented with cutting edge data infrastructures ranging from cloud computing services to machine learning and data-mining techniques in working with our data. Importantly, we addressed some key problems of how to circumvent the limitations or packaging of data produced by the platforms.
3. We developed novel social theoretical constructs -- such as the concept of the 'field of devices' and the notion of 'metacommunity' -- to help understand the dynamics and processes we see at play around software, media, culture and digital economies. We believe that these constructs will be more widely useful.·They draw on existing sociologies of culture, technology, cities and crowds, and combine them with other recent social science accounts of device-specific research. Our work also addresses the need for viable and effective sociological alternatives to the somewhat flattened accounts of social life advanced by social physics and some computational social science.
4. Finally, practically emulating the subjects of our research, we were able to make use of the same practices and infrastructures as the people we were researching -- software developers and other users of code repositories -- to not only present the results of our work, but to document and distribute almost every stage of research work. We have developed a way of versioning the research project, including data analysis, scripts, documents as a 'release' that can be cited (using DOI identifiers) and that will persist through time.
Exploitation Route We are not sure of this yet, but early publications from the project and invitations to develop both the social theoretical insights and the data analytics approaches have started to appear. We think that the notion of 'field of devices' and some of the techniques we developed to explore the diversity of code as a metacommunity will be useful to analysis of other social, cultural and e
conomic domains.·
We anticipate that our approach to versioned releases of packages of data and analysis offers a model that could be used more widely in social sciences.·
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Government, Democracy and Justice,Culture, Heritage, Museums and Collections

URL http://metacommunities.github.io/metacommunities/
 
Description This one year project was focused on exploring and demonstrating the viability of data analytic methodologies for tracking groups, organisations (civic, governmental and commercial) and devices present in large online code repositories such as Github.com. The project sought to explicitly engage with groups, organisations and devices (platforms, programming languages, software tools, data infrastructures) at a highly aggregate level (on the order of tens of millions of code repositories, and five million programmers and developers). The social and economic impact of this endeavour has not yet connected directly to these organisations, groups and individuals. Although the project's pathways to impact included software developer groups and communities, we did not manage to engage with them directly within the time frame funded by the grant. We did, however, engage with the other related organisations during the course of the research in several ways. The main beneficiary of this work was, somewhat unexpectedly, Google Corporation. We made such extensive use of one data analytic tool -- BigQuery -- that the Google Cloud Platform Development team based in California contacted us directly to discuss our use of it. This led to a 90 minute Skype interview with Stefano Menti, a team manager at Google. This discussion, which Menti recorded, was largely focused on the ways in which cloud computing services (databases and compute capacity) could become more usable. At Menti's request, we also wrote a blog post for the Google Developer's blog describing our experience, but this blog post was not published. Although unlikely to be directly attributable to our conversation, soon afterwards, Google greatly reduced the cost of using BigQuery. At the end of the project, we wrote a discussion paper for circulation through Google and other interested parties. This paper described our experience and some of the difficulties in working with large online datasets using cloud-based analytics tools. The paper was discussed at a workshop held in Google London, January 2014, with Google UK employees and other academic participants. The discussion paper is available at the 'metacommunities' Github repository
First Year Of Impact 2014
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Economic

 
Title metacommunities github repository 
Description This repository contains secondary data, intermediate results, data analysis scripts (R and Python), database queries (for Google BigQuery-based dataset) and draft manuscripts based on this work. 
Type Of Material Database/Collection of data 
Provided To Others? No  
Impact This is newly released (Nov 2014). The DOI for citation purposes is shown below 
URL http://dx.doi.org/10.5281/zenodo.12651
 
Description BSA Presidential Event: The Challenge of Big Data, Sept 2013, British Library 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Judging both by discussion at the talk and later blogging and online mentions, this event has informed discussions amongst social scientists about big data and research methods

None observed yet
Year(s) Of Engagement Activity 2013
 
Description "Many names and large numbers: methods for imitative fluxes?" Methods Mixtures Series, Centre for Science Studies, Lancaster 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact This talk will be part of series of talks on methods in sociology of science and technology.

This is yet to occur
Year(s) Of Engagement Activity 2015
URL http://www.lancaster.ac.uk/fass/centres/css/event/5373/
 
Description ' Public Lecture '192 million events and counting' IT University of Copenhagen 22 April 2014 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Talk led to discussion with Copenhagen postgraduates from three universities about the development of new digital methods for research in contemporary technological cultures.

Several postgraduates subsequently attended a training workshop I ran
Year(s) Of Engagement Activity 2014
 
Description 'Digital sociology' postgraduate student training provision 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact We made available the code and preliminary analysis of data for the project from our own code and data repository to MA postgraduate students on the 'MA Digital Sociology' Goldsmiths.

The course convenor writes: 'The materials were used as part of seminar series for postgraduate students studying MA Digital Sociology as a case study demonstrating current best practice in Software Studies analysis'
Year(s) Of Engagement Activity 2014
 
Description Code Acts in Education ESRC-funded workshop Stirling, 12 Sept 2014 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact This was an invited presentation to education researchers as part an ESRC-funded seminar series on software and coding. The discussion suggested that the work was regarded as stimulating new perspectives on how coding is being done.

There were no direct impacts apart from the discussion at the workshop
Year(s) Of Engagement Activity 2014
URL http://codeactsineducation.wordpress.com/seminars/
 
Description Data visualization workshop, ITU Copenhagen 20 June 2014 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Undergraduate students
Results and Impact This 1 day workshop attracted approximately 25 participants from Denmark and southern Sweden. It included PhD students and some staff from 5 different universities. The students and staff were from varied disciplinary backgrounds including anthropology, sociology and information systems.

Online news of this activity led to invitations to repeat the workshop here in the UK (e.g. University of Edinburgh)
Year(s) Of Engagement Activity 2014
URL http://www.dasts.dk/?p=2658
 
Description Google Workshop, London, 14 January 2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I presented an account of key elements and findings of the research project to an audience comprising Google staff, ranging from policy advisors to engineers. Academics and ESRC staff were also present. A written report had been circulated in advance of the meeting. After the talk, discussion and questions from Google participants mainly focused on how Google data analytic tools and services had figured in the project. Questions from academics concerned the methods and findings of the project.

Along with other discussions with Google, the main impact would probably be some awareness of Google staff about how their data analytic tools might become more widely used in research work.
Year(s) Of Engagement Activity 2015
 
Description Keynote Address to Denmark Science and Technology Studies Conference 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Type Of Presentation keynote/invited speaker
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact This opening keynote presentation entitled 'Device specific research and massive streaming events' was described by one participant as 'pioneering.'

The main impact of this presentation were further invitations to present work elsewhere in Denmark (at Copenhagen Business School & at Aarhuus University).
Year(s) Of Engagement Activity 2014
URL http://www.dasts.dk/wp-content/uploads/2014/05/DASTS-program-2014.pdf
 
Description Machines of the code-sharing commons, a mid-way report on a slightly large scale analysis of software repositories, 12 March 2014 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The talk was part of an ongoing series on 'data practices' held in 2013-2014 at Goldsmiths, London and has led to ongoing discussion with various academic peers.

This session was meant to create awareness of the project amongst London-based UK researchers. Later discussions and invitations to present work in other forums in the UK suggested that worked
Year(s) Of Engagement Activity 2014
URL http://www.gold.ac.uk/csisp/events/csispevents2013-2014/#d.en.64734
 
Description Metacommunities of Code Blog 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Participants in your research and patient groups
Results and Impact We blogged about interim results of the project and opened the blog to feedback and comments from software developers

We did receive feedback, partly in response to our own questions about how to make sense of the data we were looking at.
Year(s) Of Engagement Activity 2014
URL http://metacommunitiesofcode.org/
 
Description The Allure of Big Data LSE Social Study of ICTs Workshop, 25 April 2014 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact This was an intervention designed to provoke discussion about the scope and potentials of data analytic techniques in the social sciences and humanities. It did create some controversy because it was somewhat critical of the promises of the techniques.

Nothing of this nature arose.
Year(s) Of Engagement Activity 2014
URL http://www.lse.ac.uk/management/events/conferences/140425-SSIT-14.aspx
 
Description Workshop paper at Code and City Workshop, Maynooth, 3-4 Sept 2014 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Type Of Presentation keynote/invited speaker
Geographic Reach International
Primary Audience Other audiences
Results and Impact This was pre-circulated paper delivered to an invited audience. The paper was discussed quite exhaustively as the workshop papers are scheduled for publication in an edited volume in 2015.

The workshop was part of an EU-funded initiative on 'The Programmable City' and the paper has led to further invitations to contribute work to the initiative.
Year(s) Of Engagement Activity 2014
URL http://www.maynoothuniversity.ie/progcity/?p=685