Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale - Archive ouverte HAL Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2019

Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale

Résumé

We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years. We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits. We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.
Fichier principal
Vignette du fichier
swh-growth-tr.pdf (1.19 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-02158292 , version 1 (17-06-2019)
hal-02158292 , version 2 (19-06-2019)

Identifiants

Citer

Guillaume Rousseau, Roberto Di Cosmo, Stefano Zacchiroli. Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale. 2019. ⟨hal-02158292v2⟩
144 Consultations
80 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More