The Googler book on how different engineering is w...
# thinking-together
a
The Googler book on how different engineering is with tens of thousands of developers is apparently out! https://twitter.com/gergelyorosz/status/1253051516228952067
(I work at Google but I don’t have anything to do with the book.)
i
Is Google's code still all stored in a giant monorepo?
a
Almost al of it. The open source projects sometimes find other arrangements to better support outside collaboration.
r
I came here to ask the same monorepo question. The question I'm always so curious about with monorepos is what you do about your CI? In my personal repos, one of my biggest annoyances is waiting for CI to run, and the main way I keep the CI time as short as possible is modularizing my code into different repos.
I'm guessing they probably use heuristics to determine which parts have to be rebuilt? But I'm surprised there's not more discussion of how that technique works. In my opinion no one should be using a monorepo without a solution to that problem.
a
I’m not too eager to go on the Permanent Record about corp internals, so I will just point at the book again and hope it answers that question. 😅
🍰 5
r
Fair 🙂
e
There is no better evidence for the "system is broken" assertion, than the gigantosaur that is the google central code base, which is so large they don't even want to admit how big it is, because it shows that massive code duplication is going on. It was 2 billion lines in 2015, and probably over 10 billion lines now. Clearly it is easier for the engineers to reinvent the wheel than search through the giant pile of stuff they have. We have yet to enter the era of interchangeable parts, and that will be a revolution.
👍 2
a
I think it’d be more accurate to say it’s easier to reinvent and/or copy a wheel than to repurpose one.
k
@robenkleene your CI question should be answered by https://bazel.build. The unit of integration doesn't really have to be a repo.
👀 1
e
@alltom you bet, i would wager there is tons of copy/paste going on inside that monstrous pile of code. Contrast that with a set of well tested, documented components that do one small thing well. Lego vs. the library of congress.
Bazel looks nicer than Make, but since it doesn't compute the dependencies, you have to set up the dependencies yourself, and making the makefile for C was always a pain. It is one of the weaknesses of many older languages is that you can't compute the dependencies quickly and easily. Contrast that with Modula-2 which did not allow dynamic dependencies, and forced all IMPORT statements in the first lines of the program, so you can could scan them instantly to find out what other modules are needed. I am trying like heck to not need a tool like Bazel in my language. Especially with possible remote installation steps to consider, it is a very tricky problem indeed.
👍 2
j
Certainly more duplication of effort outside Google in the tens/hundreds of thousands of software companies who write their own glue code than inside Google. (The same is true of any big company, mind you).
💯 1
a
Yeah, I don’t think Edward was saying Google’s repo was any more of a cesspool than any other (in this instance? :), but that even in a situation that theoretically maximizes the capacity for reuse, there’s a crazy amount of code, which I would agree with. Hickey had that whole talk about decoupling the various semantics of “map” just to the extent that it could run in serial or in parallel, and it’s just plain impossible for every engineer to put that amount of thought into every bit of code they write. I feel that most of the lack of reuse / repo bloat is just that: someone writes a module that complects the type of input, type of output, programming language, runtime, threading model, framework, etc, and separating any of those is just so hard that it gets rewritten instead, but with all the same rigidness in a new configuration, because it’s still hard to write in a factored way.
👍 2
☝️ 2
e
It is the tangle of dependencies that one encounters that is the main impediment to code re-use. You try to use just one routine, but it calls 3 others which call 3 others, etc., leading to an above-linear increase in the number of included code accompanying the small piece you really wanted. Some languages are more easy to "tree shake" than others. I once converted a giant C program to Modula2, and the lines dropped in half, even though the languages are 1:1 identical in almost every statement. The trick was that in Modula2 you had to name every import, and it took work to do that, and to avoid that hassle, unconsciously you would write code that minimized dependencies, and the net result was a higher level of sharing. I would bet with their high salaries and consequent high quality of workers, Google is probably better than average companies at encouraging code re-use, but Google cannot escape the fact that they are using C++, JS, Java, etc; all languages designed without much care about code re-use, so the geometric increases are going to happen whether they like it or not, simply because of the languages they are using. This same problem of "copy-pasta" i believe it is called happened at IBM many years ago. IBM actually encouraged duplication (!) because they felt that otherwise you would be breaking other people's projects with your changes, so they thought non-sharing was safer. Of course it leads to low productivity and the non-fixing of bugs across semi-shared code. Then you have the opposite approach in big open source projects where the damn thing is broken constantly, and you have to cherry pick the right release day to get something reasonably stable. No question in my mind that a world of interchangeable software parts is possible, and will represent a seismic shift in the industry.
150 lines of code added per person per day, times 200 days a year times 30K developers equals 900 million lines per year added to the repository. If it was 2 billion lines 2015, let's guess that it is about 4-5 billion lines now, based on hopefully replacements rather than pure addition.
k
Fortunately 150 LoC/day is an order of magnitude too high for a large company. By the time you've checked that things work for 100M users, and gone through 30k tests, and put up with 16 rounds of bikeshedding and miscellaneous language lawyers, you're lucky to average 10 lines a day.
😆 5
a
Excuse my ignorance of Modula2, but when you say that it encourages you to minimize dependencies, and that you solve that in a way besides copy-pasting, does that mean you have just one dependency on a “kitchensink” module?
e
I built two of the biggest Modula-2 projects ever made in the language, and used it exclusively for 20 years. Modula2 was the 10 year later sequel to Pascal. The big improvement other than a few expansions of the type system so you could have POINTER TO ARRAY [-5..+5] OF A_RECORD, which beats the hell out of C's nutty notation, was the module system. Modula2 had a unique approach that actually Prof. Wirth discard as cumbersome in Oberon. In Modula-2 each module had 2 separate files, an implementation file (the big one with the actual code) and a definition module, which held publicly visible constants, types, and function declarations. You compiled each of these 2 files separately; the definition first of course. What this gave you is that once you pinned down the external aspects of a module you were free to change the implementation part, and any other module that depended on it did not have to be recompiled when you change only the implementation file; just relink the program and execute. This goes way beyond compiled headers in C, because most of the time you change something in the implementation module, and don't modify the number or type of function parameters. This means recompilation of a 100k line program takes seconds. For small projects it is like having a REPL. There is a freeware Windows compiler (formerly the Stony Brook compiler) on the ADW website. This was the compiler i used for the WIndows side, and there is a small german firm P1 that made the Macintosh compiler. Anyway by forcing you to constantly evaluate what symbols are public or private, and being able to check that every call has the correct precise type, a great many errors are caught at compile time. Modula-2 was targeting system programming; it had no graphical primitives, so I used Win32 API and on the Macintosh the QuickDraw system (for OS7). The way the language is structured, it drives you subtly towards a very modular type of coding style. You end up with various function libraries that tackle different tasks, and as your program gets larger you tend to make it even more modular and systematic. So it is a language that encourages virtue and a slightly lower exponent of expansion than C for example. When runtime checks are fully enabled in Modula-2 it puffs up the code by 30%, but that means you have tens of thousands of range checks, assignment compatibility, null pointer, etc. that are very helpful during testing. For golden master you turn them off and your program gets a big speed boost. Although i only got to speak with Prof. Wirth once, i consider myself a disciple of his Swiss school of programming which is all about neatness, economy, and rigor. Modula2 as a language was damaged severely when Prof. Wirth made a sequel called Oberon, but in Oberon 1 he stripped out some very valuable features from Modula-2 and thus made it impossible for the Modula-2 users to move forward. This was not corrected until Oberon 2 and by that time the world has discovered the new panacea of OOP (a disaster IMHO), and Java ran away with the academic market. Modula2 did not have a good free compiler. Interestingly enough Logitech, the famous Mouse maker, was founded by a Swiss person, and they offered one of the first symbolic debuggers, which had the amazing feature that when your program crashed, it would save the entire state of memory and registers, and you could then browse the moment of the crash will full symbols. This post-mortem dump as it was called was a fantastic step forward over the memory dumps and very crude crash reporting systems of other languages. I mention this because the single hardest feature of my Beads language is the ability to time travel debug post mortem from user submitted dumps. It is pretty easy on a huge development machine to support time travel debugging, but to do that in a shipping product in the customer hands, that is something you don't see often. It is the intermittent errors that are the hardest to debug in my experience, and recreating the exact conditions of the user can be often impossible. I think the biggest scandal in computers today is not the size of the code bases, because with enough sweat and blood you can get code pretty clean; what is embarrassing is that all the big companies have bug reports numbering in the hundreds of thousands if not millions of open cases that will never get fixed because the staff "cannot duplicate".
🙏🏼 1
👍 1
Here is an example of a bug i recently encountered, where Chrome doesn't display the chess unicode character for white piece correctly, because even when instructed to not promote to an emoji form, it does it anyway. I looked it up, and this bug has been outstanding in Chrome for over a year. Companies just can't seem to get their software fully correct. There's always a huge backlog of feature requests and bug reports, and years can go by with obvious errors lingering. This is the scandal. The hardware guys don't have this problem.
white pawns are drawing as emoji incorrectly
y
When you have strict rules people will hack them aka “game the system”. An anecdote from 10 years ago - someone filed a ticket to request that the Python API for an internal service that I was maintaining be more Pythonic, which I guess it indeed wasn’t. To motivate his case he linked to 3 Pythonic wrappers for the API in the Google code base. This was quite fishy in my eyes - if someone already made a Pythonic wrapper why did the other two write their own rather than use it? So I checked and indeed those 2 were pieces of code that were used to obtain the coveted internal language proficiency certification called “Readability”. While that certification was easy to get for many, due to its conditions it was challenging specifically for those tasked with maintaining existing production systems, so they needed to write unnecessary duplicated code to get the certification.. This is a small example out of many of how the system incentivized inefficiency and work of lesser quality.
👍 1
k
@Edward de Jong / Beads Project's comment about Modula2 (which I remember fondly as well) reminds me of something I have been looking for for a while: a comparative evaluation of software architectures that different languages and toolsets end up encouraging in practice. Has anyone ever seen something like that?
s
@Edward de Jong / Beads Project you're assuming the primary function of large companies is some sort of idealized engineering efficiency (and that shared components and small code bases would realize that...) and not primarily financially and politically motivated
At big tech companies employees are rewarded when they duplicate engineering work in manner that is efficient for the business and/or efficient politically
That means impress/appease someone higher up than you and potentially other teams that are in alignment, but harming teams that are in competition with you by making it harder for them to share can also be advantageous
I'm not saying it's not fucked up, but you can't examine any large tech company's engineering practice and ignore the business incentives that got us here
Much of this starts with hypergrowth preferring VC investment, but large multinational corporations are generally "inefficient" when that inefficiency can lead to some kind of economic success
I am not arguing against you btw, I'm mostly on agreement
I think it's worth examining how large companies get to where they are though
e
I have long suspected that COBOL was selected over the mostly superior FORTRAN by the employees of big corporations because it afforded more billable hours. Many a computer company lived on consulting fees earned programming in a supposedly easy to maintain language like COBOL. COBOL was verbose and very annoying. It did have BCD arithmetic and a convenient number formatting syntax, but overall was inferior. The same kind of decision was made to elevate Java to the #1 slot, when it is also a ponderous, ugly language. If your managers have no clue as to the quality of your work, they may indeed judge you "by the pound", and the more verbose the language, and the more copy/paste you perform, the higher your apparent productivity. The inability to measure and judge accurately the code quality of programmers is an interesting area. It is pretty easy for anyone to recognize good singing, dancing, and painting. Poetry is a great deal more subjective. I think programming is a great deal like mathematical poetry. And people f*cking hate poetry.
👍 3