The largest Git repo on the planet

Thom Holwerda 2017-05-24 Windows 32 Comments

Over the past 3 months, we have largely completed the rollout of Git/GVFS to the Windows team at Microsoft.
As a refresher, the Windows code base is approximately 3.5M files and, when checked in to a Git repo, results in a repo of about 300GB. Further, the Windows team is about 4,000 engineers and the engineering system produces 1,760 daily “lab builds” across 440 branches in addition to thousands of pull request validation builds. All 3 of the dimensions (file count, repo size and activity), independently, provide daunting scaling challenges and taken together they make it unbelievably challenging to create a great experience. Before the move to Git, in Source Depot, it was spread across 40+ depots and we had a tool to manage operations that spanned them.
As of my writing 3 months ago, we had all the code in one Git repo, a few hundred engineers using it and a small fraction (<10%) of the daily build load. Since then, we have rolled out in waves across the engineering team.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

32 Comments

2017-05-24 9:51 pm
codifies
K.I.S.S. … (keep it simple stupid!)
this is one beast that needs a serious diet…

2017-05-24 10:10 pm
le_c
Agree, feeling lucky being a linux user
However, in general the git fs is a good approach. In git you have the repo database and a working dir checkout. With a git fs you safe the space for the working dir checkout. Furthermore, you could mount multiple branches or revisions without additional costs. So definitely a nice thing to have…

2017-05-25 11:19 pm
dark2
Linux would also be a huge project if it already included a full desktop system instead of just the kernel and cared about compatibility of existing compiled software.

2017-05-24 10:48 pm
Megol
K.I.S.S. … (keep it simple stupid!)
this is one beast that needs a serious diet…
AFAIK that is the total size of the codebase stretching back 20+ years. Also remember that Windows includes much more than e.g. Linux (which is mostly the kernel) and that this includes a lot of variants for e.g. Xbox etc.
But I agree that with a so large a project (projects?) there are more chances for bugs to hide.

2017-05-25 1:39 am
BlueofRainbow
Should the tree be pruned of no longer used once in a while to keep it manageable?

2017-05-25 5:24 am
Megol
IMHO no, having a comprehensive development history is worth a lot.

2017-05-25 2:55 am
Soulbender
…no Visual SourceSafe?
2017-05-25 6:59 am
unclefester
MS needs to dump 95% of the developers and rewrite 90% of the code.

2017-05-25 10:08 am
avgalen
Can you also specify which 5% of developers can stay and which 10% of the code is perfect?
Of course not, you just came to troll with a completely useless comment
(good luck rewriting 90% of 300 GB with 5% of the developers!)

2017-05-25 10:53 am
unclefester
Can you also specify which 5% of developers can stay and which 10% of the code is perfect?
Of course not, you just came to troll with a completely useless comment
(good luck rewriting 90% of 300 GB with 5% of the developers!)
The code should never have reached 30GB – let alone 300GB. Code should have been constantly refined instead of adding layer upon layer of crap produced by underpaid outsourced code monkeys.
Carmakers routinely replace every major component in a model over a 20 year period. Nothing gets added until it is thoroughly tested and anything defective gets improved or replaced. A carmaker will spend millions of dollars to save a couple kilograms weight or improve fuel eficiency by 1%.

2017-05-25 1:18 pm
kwan_e
Carmakers routinely replace every major component in a model over a 20 year period.
And how often do carmakers completely rewrite their software?
—————
And how has the churn associated with ALSA -> PulseAudio, X11 -> Wayland, SysV Init -> systemd etc gone over with Linux developers and admins?
Edited 2017-05-25 13:28 UTC

2017-05-26 6:01 am
DeepThought
And how often do carmakers completely rewrite their software?
When someone detects they are cheating 😉

2017-05-25 11:58 pm
ebasconp
Code should have been constantly refined instead of adding layer upon layer of crap produced by underpaid outsourced code monkeys.
Did you have access to the Windows source code in order to strongly assert it is a lot of crap?
Did you receive some complaints about some Windows coder being underpaid?
IMHO, they did a very good job implementing all its backwards compatibility. They could not be so irresponsible in order to let legacy code broken and not running just because some jerks think the newer stuff is always better. They did an amazing job keeping their basic APIs (Win32) and ABIs completely stable.
Raymond Chen explains a lot of decisions taken into the OS to support backwards compatibility in his blog:
https://blogs.msdn.microsoft.com/oldnewthing/

2017-05-26 2:36 pm
Alfman verbose=1
ebasconp,
IMHO, they did a very good job implementing all its backwards compatibility. They could not be so irresponsible in order to let legacy code broken and not running just because some jerks think the newer stuff is always better. They did an amazing job keeping their basic APIs (Win32) and ABIs completely stable.
I used to say this as well, but in the past decade backwards compatibility has become worse. I’ve seen a lot more breakages even in corporate environments where critical business apps stop working. Backwards compatibility is no longer something windows users can brag about, IMHO. Or if they do, it’s less true than it used to be.
2017-05-27 9:40 am
unclefester
Code should have been constantly refined instead of adding layer upon layer of crap produced by underpaid outsourced code monkeys.
Did you have access to the Windows source code in order to strongly assert it is a lot of crap?
Did you receive some complaints about some Windows coder being underpaid?
IMHO, they did a very good job implementing all its backwards compatibility. They could not be so irresponsible in order to let legacy code broken and not running just because some jerks think the newer stuff is always better. They did an amazing job keeping their basic APIs (Win32) and ABIs completely stable.
Raymond Chen explains a lot of decisions taken into the OS to support backwards compatibility in his blog:
https://blogs.msdn.microsoft.com/oldnewthing/
The NT series hardware requirements have increased by almost three orders of magnitude in 30 years. That is a sure sign of massive bloat and crappy code.

2017-05-26 5:59 am
DeepThought
It is the code including history!!
Just to get an idea: Our RTOS repo needs already 1.1GiBytes (unpacked). Whereas the current branch has only 170MiBytes. And that’s only with a history of 15years!
So 300GB is not really much.
2017-05-29 9:43 am
avgalen
Just as I said, you were just trolling and cannot specify which 5% of developers can stay.
“The code” is clearly not 300 GB of active code. Compiling 300 GB of active code into the 4 GB that is a current Windows install.wim would be an amazing feat! That 300 GB is all the history and also includes things like testfiles which can be far larger than actual code. To give you an idea, I probably will not write more than 1 MB of actual code in my whole life, just like most writers will not write more than 1 MB of actual book in their whole life….but the PDF with our company logo is probably close to that 1 MB already.
Your car comparison is just silly and even false. Just like we do different things with computers compared to 30 years ago (4K full screen video on the internet vs postage stamp local videos and “HTML/CSS/JavaScript vs RTF”), the same is unfortunately true for cars so their weights have gone up and fuel efficiency has gone down: http://www.nytimes.com/2004/05/05/business/average-us-car-is-tippin… and that wasn’t so much related to technical features but very much linked to oil/gas prices and comfort!

2017-05-29 5:55 pm
Alfman verbose=1
Just as I said, you were just trolling and cannot specify which 5% of developers can stay.
“The code” is clearly not 300 GB of active code. Compiling 300 GB of active code into the 4 GB that is a current Windows install.wim would be an amazing feat! That 300 GB is all the history and also includes things like testfiles which can be far larger than actual code. [/q]
“Trolling” is too harsh. unclefester is probably right about the level of bloat, and you’re probably also right that most of that code is likely not used at all and is merely a historical archive. We could debate whether having mountains of legacy code base serves any practical purpose today, but it doesn’t really matter that much.
What bugs me more than this is the ever increasing size of the bitcoin block chain, haha. 120GB and counting, and bitcoin isn’t even that popular, and is outgrowing the bandwidth capacity of P2P users, just imagine if everyone used it!
[q]To give you an idea, I probably will not write more than 1 MB of actual code in my whole life, just like most writers will not write more than 1 MB of actual book in their whole life….but the PDF with our company logo is probably close to that 1 MB already.
I’d say 1MB isn’t that much. For example my last C++ project representing less than a year of coding time is at 701k. This is just source code with no media and excluding 3rd party libraries.

2017-05-25 5:39 pm
grat
Well, dumping an arbitrary number of developers and rewriting an arbitrary amount of code CERTAINLY won’t introduce any new bugs.
You must be an MBA.

2017-05-27 10:34 am
unclefester
How the fuck can a project blow out so much that it literally requires 1000x the hardware resources, probably 100x the developers and still be arguably less stable than it was 30 years ago?
In any other engineering discipline you would eventually realise you have a disaster, cancel the project and start with a clean sheet.

2017-05-29 9:50 am
avgalen
https://en.wikipedia.org/wiki/Windows_NT#Hardware_requirements
As you can see you can blame a lot on Vista, but there is no “1000x times increase” over the last 30 years and in the last 10 years there has basically been no increase in hardware requirements at all even though hardware DID actually get faster. Conclusion: no absolute bloat, less relative bloat, happier users!
Also, comparing NT 3.1 to “10” is pretty much like comparing an arrow to a machine gun. All that extra code and used resource didn’t just go to waste and a “10” machine now is a whole lot cheaper than a NT “3.1” machine used to be

2017-05-25 3:29 pm
nadiasvertex
The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google’s entire 18-year existence.
The repository contains 86TB of data, including approximately two billion lines of code in nine million unique source files.
https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billi…

2017-05-25 10:18 pm
Lazarus
Not in one repository.

2017-05-26 7:04 am
kokara4a
Not in one repository.
The linked article says otherwise. It’s even in the URL.

2017-05-29 3:19 am
Soulbender
That’s not using Git though.

2017-05-26 9:34 am
Lennie
Why do you put it in one repo ?
That’s not how you should use git. 🙁

2017-05-29 2:54 am
Soulbender
There are arguments for having all of your software components in one repo.
Facebook does it: https://goo.gl/yMT9ZY
I do not necessarily agree with this approach but it’s actually no more right or wrong than using multiple repos.

2017-05-29 2:24 pm
Lennie
It’s not the smart thing to do, it even seems even more silly because git allows for sub-modules and they clearly already had thing split up.
https://www.youtube.com/watch?v=4XpnKHJAok8&t=43m11s

2017-05-29 6:14 pm
Alfman verbose=1
Lennie,
It’s not the smart thing to do, it even seems even more silly because git allows for sub-modules and they clearly already had thing split up.
https://www.youtube.com/watch?v=4XpnKHJAok8&t=43m11s
That’s a very insightful link. I hadn’t seen it before, but in summary Torvalds thinks putting everything in one repository is “not very smart”.
I’d point out that it’s not actually an unreasonable thing to do in principal, but that git itself doesn’t scale the same way as other systems do. Torvalds sort of acknowledges this as well.

2017-05-29 9:52 am
avgalen
Before the move to Git, in Source Depot, it was spread across 40+ depots and we had a tool to manage operations that spanned them.
Not needing such a tool anymore would probably be a good enough reason for this change
Edited 2017-05-29 09:54 UTC

2017-05-26 12:30 pm
hussam
This is basically a very long revision history of the equivalent of: “Linux kernel+ Xorg + Gnome/KDE + systemd +pulseaudio + openssl + libpng + webkit + gtk/Qt + and many others”.
300GB isn’t too far from what you may expect from the sum of many projects under one company’s development infrastructure. If they are keeping all the history since Windows 2000, it is likely around 120GB in compressed revision history and 180GB in fully pulled files.
2017-05-28 3:35 pm
PhilB
Basically 90% of Windows development is done by Gits?
I already knew that.