Service interruption on Monday 11 July from 12:30 to 13:00: all the sites of the CCSD (HAL, EpiSciences, SciencesConf, AureHAL) will be inaccessible (network hardware connection).
Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale

Abstract : We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years. We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits. We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.
Document type :
Preprints, Working Papers, ...
Complete list of metadata
Contributor : Stefano Zacchiroli Connect in order to contact the contributor
Submitted on : Wednesday, June 19, 2019 - 10:41:00 AM
Last modification on : Friday, February 4, 2022 - 3:31:54 AM


Files produced by the author(s)


  • HAL Id : hal-02158292, version 2
  • ARXIV : 1906.08076


Guillaume Rousseau, Roberto Di Cosmo, Stefano Zacchiroli. Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale. 2019. ⟨hal-02158292v2⟩



Record views


Files downloads