Wednesday, September 19, 2018

A German roofer working on a cathedral found a message in bottle, written by his grandfather - The Washington Post

A German roofer working on a cathedral found a message in bottle, written by his grandfather - The Washington Post

A German roofer working on a cathedral found a message in bottle, written by his grandfather


An 88-year-old message in a bottle was found under the roof of the cathedral vestibule in Goslar, Germany. (Julian Stratenschulte/Picture-alliance/DPA/AP)

— On March 26, 1930, four roofers in this small west German town inscribed a message to the future. "Difficult times of war lie behind us," they wrote. After describing the soaring inflation and unemployment that followed the First World War, they concluded, "We hope for...

Subscribe to read the full article

The Human API Manifesto - Study Hacks - Cal Newport

The Human API Manifesto - Study Hacks - Cal Newport

The Human API Manifesto

September 18th, 2018 · Be the first to comment

The Bezos Mandate

In 2002, Amazon founder and CEO Jeff Bezos sent a mandate to his employees that has since become legendary in IT circles. It reads as follows:

  1. All teams will henceforth expose their data and functionality through service interfaces.
  2. Teams must communicate with each other through these interfaces.
  3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
  4. It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols — doesn't matter.
  5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
  6. Anyone who doesn't do this will be fired.
  7. Thank you; have a nice day!

This directive, which some informally call Bezos's "API Manifesto," transformed Amazon.

To be sure, transitioning to these formal APIs made life harder in the short term for its engineers. It was also expensive, both in terms of the money spent to develop the new interfaces, and the time lost that could have been dedicated to projects producing immediate revenue.

But once the company embraced Bezos's mandate, it was able to operate its systems much more efficiently. It also enabled the launch of the public-facing Amazon Web Services, which now produces a much needed influx of profit, and allowed Amazon's web store to easily expand to encompass outside merchants, a key piece in their retail strategy.

The impact of the API Manifesto has since expanded to the IT industry as a whole. From start-ups to massive organizations, the idea that information systems are more valuable when interacting through clearly specified and well supported API's has become common.

Last week, for example, the cofounder of an IT firm told me the story of how he helped a large financial services firm implement an API for a set of services that were previously accessed in an ad hoc manner  (think: batched FTP).

It cost the firm a little over a million dollars to make this transition. He estimates it now helps them earn an additional $100 million in revenue each year through a combination of cost savings and the new customer acquisition applications enabled by providing a clearly specified and accessible interface for these services.

On Attention Capital

When I heard about the API manifesto, a provocative thought popped into my head: could these same underlying ideas apply to communication between people?

To provide some background to this question, let me first remind readers that my attention capital theory argues that the most valuable capital resource in a knowledge work organization is the brains of its employees. Or, to be more specific, the capacity of these brains to focus on information, process it through neurons, and then output more valuable information.

Success in knowledge work is about getting the best possible return on this attention capital, much as success in the industrial sector is about getting the best possible return from physical capital (factory equipment, trucks, shipping containers, etc.).

I believe that many knowledge work organizations currently get sub-standard returns on their attention capital because the workflows they deploy — which are often unspecified and emerged haphazardly  — depend too heavily on constant, unstructured communication, which conflicts with the way the human brain operates, reducing these brains' capacity to think deeply and produce valuable output.

The natural follow up question to this observation is to ask what work would look like without constant unstructured messaging. It's this follow-up question that brings me back to APIs…

The Human API Manifesto

Imagine if a Jeff Bezos figure at a major knowledge work firm sent a mandate to his or her employees that read something like this:

  1. Our work will be seen as a collection of processes that take in specific inputs and produce specific outputs. Individuals are associated with the processes that they support.
  2. Each process has well-defined and well-documentated communication protocols specifying how information comes into the process and how it leaves. It also has protocols specifying how the individuals associated with the process internally coordinate, including when and how this coordination occurs. We call these protocols human APIs (hAPIs).
  3. There will be no other form of inter-personal communication allowed: no generic email inbox or instant messenger channel that can be used for any purpose, and no casually dropping by someone's office to make a request. The only communication is through hAPIs.
  4. Different hAPI specifications will include different technologies. Some will include no technologies at all. The details of the tools used to implement these protocols are less important than the protocols themselves.
  5. If a particular request or notification seems too minor to justify its own process, or hAPI within an existing process, consider eliminating it. The need to specify hAPIs will help our organization focus more relentlessly on activities that create real value, and help eliminate minor asks that are convenient in the moment, but end up reducing the return on our attention capital in the long term.
  6. Anyone who doesn't do this will be fired.
  7. Thank you; have a nice day!

Brilliant or Blunder?

A mandate like this would either turn out brilliant or a colossal blunder, which is exactly why it intrigues me.

The arguments in favor of this being brilliant include the observation that well-crafted protocols can minimize the cognitive overhead required to keep track of the different projects and tasks on your plate. Instead of drowning in an ever-filling pool of messages, you can instead work to satisfy a clear set of expectations and optimized action.

The structured nature of this communication also eliminates the requirement to constantly monitor general-purpose communication channels, which helps minimize attention residue — generating a non-trivial boost in your cognitive capacity. As I argued in Deep Work, if you can avoid constant "quick checks" of inboxes and channels, you can learn hard things faster, and produce higher quality output in less time.

In addition, well-documented hAPIs make it easier to integrate new hires or seamlessly hand off responsibilities when someone is sick or away on vacation, enabling a much more flexible deployment of an organization's attention capital.

And as hinted above, the specificity required to implement hAPIs forces an organization to be transparent about all the ways their attention capital is being tapped, supporting a move toward long term value production and away from short term convenience.

On the other hand, there are many reasons to suspect this human API approach could prove disastrous if embraced.

For one thing, it would be a massive pain to have to reduce the messy ambiguity of the typical knowledge work organization to a set of clearly specified processes and hAPIs.

Even once completed, this approach might not be nearly agile enough to keep up with unexpected needs or demands, creating lots of hard edges at which projects are stalled or opportunities missed.

It's also possible that my attention capital theory is wrong. The current trend in workflows within the knowledge sector is to prioritize flexible coordination over maximizing cognitive output. Maybe this is actually the best thing to do.

Finally, it's worth acknowledging the practical difficulty of getting an entire organization to actually buy in to such a radical transformation (for more on this, see Sam Carpenter's Work the System).

An Ambiguous Conclusion

Something like the human API approach might be the key to evolving the knowledge sector to a new level of effectiveness. Or it's stupid.

I can't quite tell.

If you've had any experience with this type of approach (for better or for worse) in your own organization, I'd love to hear about it (interesting@calnewport.com). If the idea sparks a strong reaction in you (for better or for worse), I'd love for you to elaborate in the comments.

Monday, September 10, 2018

I Find Bugs too Boring to Write by Arlo Belshee – Deconstruct

I Find Bugs too Boring to Write by Arlo Belshee – Deconstruct

Transcript

Refactoring (Martin Fowler et al.)

(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Hi. So I'm a legacy code mender. [LAUGHS] And by that, I mean I'm a person who finds bugs too boring to write, and so has dedicated my life to not only not really writing very many of them myself, but changing code so that other people don't write them, either. So I want to talk about that a little bit. There's a table. I can go in front of the table, because I don't need this for a minute.

So I'm going to tell a story of a couple of different companies. So one is Hunter Technologies. It's an interesting company down in the Bay Area. They've got a bunch of teams. This is where mob programming happened to be invented.

But I was talking with them about five, six years ago as they were starting to work in mobs. And they were telling about how they'd had this dramatic change in defect rate in the software. And what is that? And so they started talking about bugs, not in terms of bugs that were found, or found by QA, or found on the factory floor. They do factory floor automation. But it was the number that [? aren't ?] written at all.

And so I asked how many they'd had. And they said, well, this year there was one. And then I asked next year. And they said, well, it's only halfway through the year yet. So still zero. Over the course of five years, they had three. Now, technically, that's not bug zero. But I'm willing to give them credit. [LAUGHS] So that's one company, and one end of the spectrum.

And then there are many, many, many, other companies. And I've worked on a lot of code bases, everything from small four or five million lines of code to some of the larger 250 million lines of code sorts of code bases, anywhere from a dozen developers to 4,000 or 5,000 developers plus managers, and everybody else who's in the department, right? And their defect rates are not quite the same.

And I don't know about you, but I find the life experience of living in teams of these two kinds to be qualitatively different. And when my life is about finding bugs, triaging bugs, explaining why this bug is some other team's fault, and therefore I don't have to do anything about it, management decisions are about reporting bugs, there's no excitement there. There's no delivery of value to the customer. It's not fun.

So when I do work with software, I want to bring the fun back, and the fun comes from really working on the real stuff. And bugs get in the way. So I find bugs too boring to write. And that has caused me to really look at why do we write bugs, because I'm a developer. I have written bugs. I have never actually intended to write a bug. Has anyone in this room intended to write a bug? Wow, brave soul. [LAUGHS] We've got a few brave souls there.

So when you did that, were you actively testing your testers? Yes, OK. [LAUGHS] That is a good reason to write a bug, yeah. But that's about it, right? And yet, every bug is hand crafted, artisanal, made with care, one-off produced by some of the more brilliant and imaginative minds that we have out there.

So why do we make all those mistakes? Looking at it, I found there are a couple of reasons, and not many. But we've got some guides who can help us find these reasons. We will be in code in a little while. And I want two faces burned into your memory so that you can call out when those happen. So one is Inigo Montoya.

So the number one reason that developers write bugs is they're looking at some section of code they need to modify in order to change the behavior of the system, and this code does not do what they think it means. [LAUGHS] Yeah. When code is locally illegible, it's basically impossible to modify it without introducing a bug. As soon as my understanding of the code differs from the original artist's understanding, differs from the customer's understanding, differs from the computer's understanding, then it's just a matter of time.

So the first thing that we need to do to eliminate bugs is to make code legible. And actually, taking this to an extreme, I was looking at some research that was done in the early '80s on sources of bugs. It was actually intended to be an argument for why C was a much more effective language to program in than the things prior to it.

And the argument they tried to make was that it lets people work at a higher level, and you should see fewer bugs. And so in the process of doing that, they had to look at all these bugs, and see what correlated with it, and so on. And lines of code was highly correlated with bugs. And so they said C is better, because it was fewer lines of code than almost anything except COBOL at the time to do similar sorts of jobs.

But they found there was an even stronger correlation. The number one predictor of a bug in those programs is something that we have eliminated now. And that was inconsistent white space. You are more likely to cause a bug by changing the characters that are not noticed by your compiler than by changing the ones that are, because as soon as you have a mixture of tabs and spaces in the same file, or the wrong number of tabs on a line, then humans read that code very differently from how the machines read it, and bug.

So this problem went away significantly as soon as we all switched to using IDEs that auto format on every edit and save, right? Tools. By the way, one of the great ways that you can make a choice to live a bug-ridden life is to use crappy tools. [LAUGHS] Your choice of tools is a choice of how many bugs you want to write, and how many bugs you want to debug over your career. And even really simple ones turn out to make a big difference.

Second source of bugs. [INAUDIBLE]. OK, so assume that we've got code that I can tell what the heck it's doing locally. And there's still another really common cause of bug. I read this code, it makes total sense. I understand it. I make a change. It's just that my understanding of this turns out to be legitimate in the one context that I was reading it in. But it's got unexpected dependencies or interactions with foreign parts of the system.

And so it turns out it can be accessed in very different contexts, and it has very different purposes, and uses, and effects. And you see spooky action at a distance. I make a change here, and something over there breaks. Why? I don't know. So this occurs any time that we have behavior that is not locally determinable, nondeterministic. So you swallow a little more than you expect you swallowed.

So those are the two most common sources of bugs out there, as far as I've been able to see and understand from poking around a lot, and even reading some research. The next most common one is communications problems. The problem that the developer thinks their solving is not the problem the customer thinks they're solving. And that's also worth talking about, but not in this talk, because I want to code, and that one's not code.

So the last thing is when I talk about bugs, people nod along here for a little while, and they said, Arlo, but there's one more. And that is, my architecture is crap. This is a special case of that second bug. Yeah, I would like my code to not have spooky action at a distance. But remember those 10 million lines of code? They're all in one monorepo.

They are all compiled in one release. My build does take 2 and 1/2 hours to put everything together, and to run all our tests. I do have to test everything from the outside. We try and do unit tests, but you got to understand the size of a unit, right? [LAUGHS]

Yeah, so architectural flaws. So what we need is to do something that not only solves these couple of problems, but solves it in a way that can allow us to address fundamental architectural flaws throughout the system. And so these can be things like one project that I was working on, the project when we started, had a million and a half lines of ColdFusion, which was an interesting technological choice, given that no one on the team knew ColdFusion. And it was written like no one really knew ColdFusion. There was a lot of SQL in it.

And so we decided to make a shift. And we were going to move this to ASP.NET NVC, and C#, and all the goodness. And a million and half lines to do that conversion. Well, we were doing medical claims processing. And it turns out people don't stop getting sick or hurt just because you want to change technologies.

So it's really a good idea to keep being able to process claims. And furthermore, the government doesn't stop changing regulations, nor do doctors and patients stop finding new ways to commit fraud, just because you want to change technology. So we needed to be continually releasing the whole time. So we did our shift gradually from the one language to the other from a procedural style to NVC style, releasing four or five times every day the whole time through, and adding new features along the way.

And here's how. So this code base is what you get if you're a loud mouth on the internet. So I was talking with people on the internet. And I have a perhaps well-known distaste for mocks. I think mocks are evil. With good evidence, I think they're evil.

And so after one of many loud conversations about this, someone said, all right, but they're a necessary evil. I'll give you some code. Try and test this without using any mocks, or fakes, or anything. So that's what this program is. And it is not intended to be perhaps in the optimal state for testing. So it's one simple 80 line function. That's the whole program. It works, trust me. Oh, by the way, there are three bugs in this program. Did anyone spot them?

[LAUGHTER]

Yeah. [LAUGHS] I didn't either, for a while. So I went ahead and I did this on my own once about two months ago. And I haven't looked at it since. So I figured I'd do it live in front of you guys. And let's see where it goes. So our objective is we have this piece of code, which presumably, I'm going to make some future change in.

And we'll just go ahead and make it easy. We'll say that the customers have identified-- there's some funny thing that happens with critics being overly trusted with some particular sets movie reviews. Now, if you don't understand what the hell I just said, that's fine, because neither did I. I know nothing about this domain as we start.

So we want to find and fix that bug. What do we do? [LAUGHS] Run it? OK. Actually, I don't really know how, because you see, as I start to run it, at least when I first got it, this program linked against a version of SQL Server that was old enough that I couldn't get an installer for it. [LAUGHS] So find the bug. [LAUGHS] By the way, you can't run the program.

Anyone have another idea? It does compile, so it's not bad.

[INAUDIBLE]

Yeah, why don't we do some of that? So I already did the first couple of commits here. I put it into a solution, and I auto formatted the document. So at least white space is consistent, and those sorts of things. But yeah, why don't we get it to be readable?

So the first thing that I notice is functions, especially legacy functions, tend to have a pretty standard structure. First, you get the wrong arguments. You have a useless name. In this case, Main. That's real hopeful. That conveys a lot of information. And then you get the useless arguments. And then you have a chunk of code, which transforms from the wrong arguments to the data you wish you had.

Then once you have the data that you wish you had, you do some amount of work. I don't know what the work is in this case. And then usually, at the end of doing that work, you prepare some information that you give back to the caller. In this case, it doesn't look like we do much of that. It looks like all the work is done by side effects. So that'll make it even more fun to test.

So let's just go ahead and get started. The interesting part of any function that has that structure is generally in the middle. So usually, what I'll do is I'll go to the bottom of the function, like I just did, and then I'll scroll up to find the first block, first large block. And that's this one. And it's nice, because someone even tagged it. So let's just start making it readable. OK. Can anyone tell what Main does now?

[LAUGHTER]

Maybe I don't know what it does, but I have a good guess. All right. So what am I doing here? Any thoughts?

You're refactoring.

I'm definitely refactoring, yeah. [LAUGHS]

Breaking it into smaller units.

Yeah, I'm breaking it into smaller units. But there's a really important thing that I'm doing. It's not just that smaller units are easy to understand.

Giving concepts to names.

[INTERPOSING VOICES]

Bingo. Giving names to concepts, yeah. And so the console read line was a particular one where I guessed what that concept was. And I'm giving it a name. Now, you could see also from this that I might not be completely confident in my belief that that's the right concept here.

So what I'm trying to do here is I'm working through creating names and creating concepts along with them, but it's an iterative process. So I've just done a couple of things. That's way more code than I ever want to have not in source control. So I'm about to commit to source control. What's the probability that I broke something?

Low.

Low, OK. Is it low enough? Do I understand this code?

No.

No. [LAUGHS] Do any of you understand this code?

[LAUGHS] No, OK. So pairing and mobbing, none of that's going to help me here, right? [LAUGHS] So what I'm trying to do here actually is read the code. So recall the whole point of this is to reduce the number of bugs that I write. So when I'm reading code in a normal traditional style, what's the probability that I introduce a bug?

Zero.

Zero, yeah. If I don't change anything, then it wasn't me that introduced the bug. Maybe an operation system went down or something, but clearly, it wasn't my code edit, right? So what I need to do is have the same probability. What I really want to do is develop a way of working that is uniformly better than my old traditional way of working.

So every type of work that I do has no more risk than it did before, no more cost than it did before, and I can drive the total cost and risk of software development down, and spend a lot more time doing things for my customers or playing foosball, and a lot less time fixing bugs. So is low good enough? No. It's not zero.

So that's an interesting question, because then the question is, so how do you get it to low enough? Did I run any tests? Nope? Am I going to run any tests?

[INAUDIBLE]

Yeah, actually, I do have NCrunch running. And it did auto run a test. Do you want to see what that test is? This is a fantastic test. It tests all the things by verifying nothing special. [LAUGHS] How much confidence do you have in the test suite here?

Zero.

Zero, yeah, approximately, yeah. So it does verify that NCrunch is working. That's good. [LAUGHS] Yeah, so I need to have confidence that I didn't break anything, that I'm basically zero probability of breaking things. And I don't have tests. That's OK, because actually, the level of proof that I need of 0 is higher than any test we could find.

So what I need is not only did I not introduce a bug. What happens if this system has been used for 25 years out in production by some companies that aren't me, and they've extended it, hooked into various APIs and called them, and they paid some attention to the API documentation, but mostly they just poked at it and saw what worked, and what the behavior was, m they wrote code that integrates against it. And they're now running a multi-billion dollar business off of that, and they lost the source code about 10 years ago. What happens if I accidentally fix a bug?

Yeah, I can't afford that, right? So I need bug for bug compatibility. I need to be able to guarantee 100% that not only did I not accidentally introduce a bug, I didn't accidentally fix a bug that I don't even know exists. All right, so if I'm going to prove that I didn't fix something that I don't know exists, is my test suite have any use? No. Everything in the test suite is something that I at least, at one point, knew existed.

So I need a higher standard of proof. And I actually do have the higher standard of proof right now. It just happened so fast that probably very few of us saw it. What do I have that's giving me enough evidence that I feel confident I'm actually 0% probability I changed behavior here?

Tools.

Tools, yes. ReSharper in particular, in this case. The refactoring tools, yeah. And if I don't have these tools-- by the way, part of my day job, I do a bunch of this work in C++, too. The tools in C++ are not quite up to the same standards of tools in Java or C#, but we need to have the same confidence.

So we've created there a bunch of recipes that lean on the compiler, because these tools and the compiler have something in common. They're using static analysis. They're analyzing the program to know for sure what changed, and what didn't. So what I need to do is create a way of working that leans on whatever static analysis or other proof systems I have available, and guarantees that each transform is perfectly safe.

Now, this is important for a number of reasons. One, because zero, it's a nice number. But another is if we look at the tree of the last time I did this, I worked for 11 hours, which is 244 commits. And I was going little slow, but not too bad. And I did get lost in the weeds. I didn't have a partner, so there was one period where for 45 minutes, there were no commits. And if I'd had a partner, eight minutes into that, he'd slap me around. [LAUGHS]

So if I'm doing 244 of these transforms, and I have a very small probability of error, the probability of error at the end is much higher. And the place where I found the bug was somewhere around here. It was somewhere in there. I've forgotten exactly. But it has 180 commits in, 160, 180, somewhere in there.

I need to know for sure that I didn't just introduce that bug, that that bug was actually there in the original. Otherwise, I'm not going to be able to have the right conversation with my customer to find out whether they're depending on it or not. So static analysis, tools, it's the thing.

So let's carry on for a little bit more. Let's look at these names. Process rating. Well, first, this one. Is that name better than what was there before? Yeah. Is it any worse than what was there before? No. Let's look at the other one. Process rating. Is that better than what was there before?

It's the same? So what do you mean by it's the same?

Because the other code said process rating, and it had that [INAUDIBLE].

Right, so the previous code had a comment saying process rating, and then had the block. Yep. So it's a little bit better in that at least the block is separate and I can see what data is flowing in and out of the block. And that's good. But the name is equivalent. Is it any worse?

Could be wrong.

Could be potentially wrong, yeah. So I've just taken something that was in a comment. How much do we trust comments?

[LAUGHS]

Right. And I've put the same text in a method name. How much do we trust method names? More than we trust comments. So if I want to be able to trust method names in this system, I can't take information that was untrustworthy and now pretend it was trustworthy. In that way, I have made it worse. I've potentially fooled people that follow.

[LAUGHTER]

Now is it any worse? So this is a mechanistic sequence that I call naming as a process. And it goes through several steps. And the first step is usually take something missing, it doesn't have a name at all, and name it Nonsense. I find a method. It's a few thousand lines long. I don't want to read it, but I know it's one block, and it's not related to the code that I want to do.

I'll name it Applesauce, because unless I'm working at treetop, I know for a fact Applesauce has nothing to do with my domain, and everyone in the company will not mistake that-- [LAUGHS] --for anything sensible right it's obvious nonsense. At least I've taken it from not so obvious nonsense to obvious nonsense. It's better.

Then the next step is honest. And the problem with process raiding was it was neither obvious nonsense, nor was it honest. Process rating, I think, is honest. Make sure that I'm on the right one. Yes, I'm on the right one. What are you? Go away. OK.

All right, so now I can continue along in this line. And in fact, what I did the last time I worked on this, I did continue along in this line for a while. And eventually, I could figure out what process rating really is. Oh, and by the way, this comment is not providing much value anymore. I could delete that comment.

And I can figure out what it does, and continue working on in that. And that's a good thing, and it's good direction. But there's another direction that's worth exploring here. And that is we talked about, well, what about when my real problem is the full system design?

So this application is designed as one procedure that runs top to bottom, and executes a sequence of arbitrary link statements against the database with most data held between statements just in whatever's in the database, and a few things in local variables. That might not be an optimal design for the ability to reason over it, or the ability to test it.

So I want to do some refactorings to start getting some better stuff there. So one of the first things that I notice as I look at this is, remember the part where I said we get input in an annoying format, and then we figure out the data that we wish we had? So it looks like the data that we wish we had are a movie ID, a critic ID, and stars. I don't know what stars is, presumably the rating in number of stars.

So we're passing around these three values. Now, any time that I see us passing around multiple values, all of which are primitives or low level things, that's three ints. Ints don't have anything to do with my domain. So therefore, not much value here. That tells me that I need a domain concept. Let's extract class from parameters. Anyone know what this is?

Rating.

Rating. Sure, let's call it a rating.

[LAUGHTER]

OK. And I now have this rating, I think, class. Well, that's nifty. I certainly don't want that in the same file here. What? You're just not going to auto fix it for me? OK. OK. I don't know what you are, but you are going away. Die, die, thank you.

So I've just created this rating class. Now, that's cool. What did it-- It was complaining that it couldn't write to some of these things. So hey, look, ReSharper has a bug.

Have you tried their tools?

[LAUGHTER]

It's a really important question there, because the truth is that our tools are imperfect, all right? But they have the advantage of being consistently imperfect.

[LAUGHS]

Myself, I am inconsistently imperfect. So I can learn what my tool does safely, and what it doesn't do safely. And for the things that it doesn't do safely, I can file a bug. And they'll actually fix that, whereas filing a bug on myself of, you should be a little better at making this mistake less often. Good luck fixing that. And it also wants to know our constructor. Sure, great. Initialize with garbage, and allow me to do things.

Ah, there's what actually happened. Didn't see that until we got here. So we screwed this up when we did our rename. So when we called it Rating, we introduced a collision on the class name Rating with something that was in the database. Turns out, that name was already used in our domain. So by going through the named rating, I collapsed two ideas on top of each other, and it gets all screwed up.

So fortunately, that's why we have this, because I just wasted an entire five minutes of work there. But I'm back to a known good state. So we can't call it rating. Instead, I'm going to call it Rating I Think, because that is a unique name.

Now, the first time that I went through this, I happened to call it Critique. And so I got lucky, and everything was good. Now, if I was paying attention to my tools, NCrunch actually was telling me the whole time that the damn code didn't compile.

OK. Cool. Now I've got class. OK, Stop Build. I'm surprised it succeeded, given that there's no internet in here. So NuGet's not going to work very well. OK, so now I see some other issues with this code. How many people like line 27?

None.

[LAUGHS] None. And none of us like line 27. OK, so let's make that a little better. It seems like it's the new rating, the one we're going to add. And now we've got this method down here, Process Rating I Think. As a static method that takes one of those, I sure wish that instead, it was-- really? OK. Presenter Mode has put things on top of other things. What do I want? I want to make instance method, which I'm not finding.

Second from the bottom.

Second from the bottom? Thank you. There we are.

[INAUDIBLE]

Thanks. OK. Poof. And now up there. So what's happening here is I'm discovering, as I go, little pieces of the domain. I still don't understand what the program's doing. I don't need to understand what the program's doing. I'm discovering what the program is doing, all right? An any design choice, which a prior programmer or me has made, is amenable to change, as long as I've got enough refactorings.

And so then there's the question of, are there enough refactorings? Turns out, if you look at all the ones that are available in Fowler's book, there seem to be, can go from pretty much any arbitrary design choice that someone has made to a new design in a sequence of, well, it might be hundreds or thousands of refactorings. But if each one's safe, I know I'm good.

So this gets us back to that. I can address architecture flaws. I just don't address architectural flaws at once. But when I'm working this way, how long is it before I can ship this code?

Now.

Now, yeah. When was the last point I could ship this code?

[INAUDIBLE]

10 minutes ago, because I made a mistake. [LAUGHS] Right? Yeah. So I can attack an extremely aggressive architectural flaw, the kind that take a year and a half to fix. And I can release every five or 10 minutes, the whole time that I'm doing it, which means when developers are writing bugs, they're writing bugs because the code is setting them up to write bugs.

And yet, I just said any flaw that's in there, including an architectural concern, we can now fix and chip away at without introducing bugs. That's what allows us to dramatically reduce the number of bugs that both I write, and that the future person writes. And the question is, can you afford it? We didn't understand this code. Do we understand the code a little better now? A little bit. I understand it a little bit better.

How much time do you think in normal dev spends in a typical day trying to read code to understand what it is, or navigate and scan around code to find what code they should read? Percentage of their work day. Any guesses?

90%.

90%, 80%, 50%. Yeah. So I've seen numbers do vary. There was an Eclipse study that they turned on their telemetry, and watched what people are doing. And they found the amount of times that people were just Alt tabbing between files, and scrolling around in them, and times when people had one piece of code on the screen for a long time, and then would click on things, and so on. [LAUGHS]

And they found that to the best of their knowledge, they saw that developers spent about 70% of their time reading or scanning code, and then about 15% of their time testing or debugging, depending on the developer. Take your pick. And then they spent about 10% of the time actually writing and editing code, and then other time for other things. So if we were going to optimize the developer, where's the place where we would make the biggest savings?

Readability.

Readability. It's in reading the code. So if I'm facing a 10,000 line method, and I try and read it, how long is it going to take before I understand that thing well enough that I can edit it?

Never.

Never. [LAUGHS] Yes. If I drop it to only 150 lines, a day, maybe a little less. What we're doing here is we just have one insight at a time, and we record it. And that takes the things from our mental field, and puts them in our visual field, and we get one idea at a time, and we build on that. This code, which you could not understand before, is getting pretty understandable. It's still got this block at the top. But after that, it's pretty straightforward.

So what we do here actually decreases the time it takes to read code. So we can pay off technical debt, architectural debt of the kind that is causing us to write bugs in a way that decreases the amount of time it takes us to write code, and without introducing bugs as a side effect. This is why discipline refactoring and really good tools are really really, really handy. [LAUGHS]

And the message that I want to get out to all the other people out there who would like to be legacy code menders and work in a system that has no bugs. You can do it. It's a matter of working in really tiny high-discipline steps, and composing those at read time so that by the time you go to write and edit any code, it's all easy to work with. Thank you very much.

[APPLAUSE]

Sunday, September 9, 2018

When Bayes, Ockham, and Shannon come together to define machine learning

When Bayes, Ockham, and Shannon come together to define machine learning

When Bayes, Ockham, and Shannon come together to define machine learning

A beautiful idea, which binds together concepts from statistics, information theory, and philosophy.

Editorial Associate "Towards Data Science" | Sr. Principal Engineer | Ph.D. in EE (U. of Iilinois)| AI/ML certification, Stanford, MIT | Open-source contributor

Introduction

It is somewhat surprising that among all the high-flying buzzwords of machine learning, we don't hear much about the one phrase which fuses some of the core concepts of statistical learning, information theory, and natural philosophy into a single three-word-combo.

And, it is not just a obscure and pedantic phrase meant for machine learning (ML) Ph.Ds and theoreticians. It has a precise and easily accessible meaning for anyone interested to explore, and a practical pay-off for the practitioners of ML and data science.

I am talking about Minimum Description Length. And you may be thinking what the heck that is…

Let's peal the layers off and see how useful it is…

Bayes and his Theorem

We start with (not chronologically) with Reverend Thomas Bayes, who by the way, never published his idea about how to do statistical inference, but was later immortalized by the eponymous theorem.

It was the second half of the 18th century, and there was no branch of mathematical sciences called "Probability Theory". It was known simply by the rather odd-sounding "Doctrine of Chances" — named after a book by Abraham de Moievre. An article called, "An Essay towards solving a Problem in the Doctrine of Chances", first formulated by Bayes, but edited and amended by his friend Richard Price, was read to Royal Society and published in the Philosophical Transactions of the Royal Society of London, in 1763. In this essay, Bayes described — in a rather frequentist manner — the simple theorem concerning joint probability which gives rise to the calculation of inverse probability i.e. Bayes Theorem.

Many a battle have been fought since then between the two warring factions of statistical science — Bayesians and Frequntists. But for the purpose of the present article, let us ignore the history for a moment and focus on the simple explanation of the mechanics of the Bayesian inference. For a super intuitive introduction to the topic, please see this great tutorial by Brandon Rohrer. I will just concentrate on the equation.

This essentially tells that you update your belief (prior probability) after seeing the data/evidence (likelihood) and assign the updated degree of belief to the term posterior probability. You can start with a belief, but each data point will either strengthen or weaken that belief and you update your hypothesis all the time.

Sounds simple and intuitive? Great.

I did a trick in the last sentence of the paragraph though. Did you notice? I slipped in a word "Hypothesis". That is not normal English. That is formal stuff :-)

In the world of statistical inference, a hypothesis is a belief. It is a belief about about the true nature of the process (which we can never observe), that is behind the generation of a random variable (which we can observe or measure, albeit not without noise). In statistics, it is generally defined as a probability distribution. But in the context of machine learning, it can be thought of any set of rules (or logic or process), which we believe, can give rise to the examples or training data, we are given to learn the hidden nature of this mysterious process.

So, let us try to recast the Bayes' theorem in different symbols — symbols pertaining to data science. Let us denote, data by D and hypothesis by h. This means we apply Bayes' formula to try to determine what hypothesis the data came from, given the data. We rewrite the theorem as,

Now, in general, we have a large (often infinite) hypothesis space i.e. many hypotheses to choose from. The essence of Bayesian inference is that we want to examine the data to maximize the probability of one hypothesis which is most likely to give rise to the observed data. We basically want to determine argmax of the P(h|D) i.e. we want to know for which h, observed D is most probable. To that end, we can safely drop the term in the denominator P(D) because it does not depend on the hypothesis. This scheme is known by rather tongue twisting name of maximum a posteriori (MAP).

Now, we apply following mathematical tricks,

  • The fact that maximization works similarly for logarithm as for the original function i.e. taking logarithm does not change the maximization problem.
  • Logarithm of product is the sum of individual logarithms
  • Maximization of a quantity is equivalent to minimization of the negative quantity

Curiouser and curiouser…those terms with negative logarithm of 2 look familiar… from Information Theory!

Enters Claude Shannon.

Shannon

It will take many a volume to describe the genius and strange life of Claude Shannon, who almost single handedly laid the foundation of information theory and ushered us into the age of modern high-speed communication and information exchange.

Shannon's M.I.T. master's thesis in electrical engineering has been called the most important MS thesis of the 20th century: in it the 22-year-old Shannon showed how the logical algebra of 19th-century mathematician George Boole could be implemented using electronic circuits of relays and switches. This most fundamental feature of digital computers' design — the representation of "true" and "false" and "0" and "1" as open or closed switches, and the use of electronic logic gates to make decisions and to carry out arithmetic — can be traced back to the insights in Shannon's thesis.

But this was not his greatest achievement yet.

In 1941, Shannon went to Bell Labs, where he worked on war matters, including cryptography. He was also working on an original theory behind information and communications. In 1948, this work emerged in a widely celebrated paper published in Bell Lab's research journal.

Shannon defined the quantity of information produced by a source — for example, the quantity in a message — by a formula similar to the equation that defines thermodynamic entropy in physics. In its most basic terms, Shannon's informational entropy is the number of binary digits required to encode a message. And for a message or event with probability p, the most efficient (i.e. compact) encoding of that message will require -log2(p) bits.

And that is precisely the nature of those terms appearing in the maximum a posteriori expression derived from the Bayes' theorem!

Therefore, we can say that in the world of Bayesian inference, most probable hypothesis depends on two terms which evoke the sense of length — rather minimum length.

But what could be the notion of the length in those terms?

Length (h): Occam's Razor

William of Ockham (circa 1287–1347) was an English Franciscan friar and theologian, and an influential medieval philosopher. His popular fame as a great logician rests chiefly on the maxim attributed to him and known as Occam's razor. The term razor refers to distinguishing between two hypotheses either by "shaving away" unnecessary assumptions or cutting apart two similar conclusions.

The precise words attributed to him are: entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied beyond necessity). In statistical parlance, that means we must strive to work with the simplest hypothesis which can explain all the data satisfactorily.

Similar principles echoed by other luminaries.

Sir Issac Newton: : "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances."

Bertrand Russell: "Whenever possible, substitute constructions out of known entities for inferences to unknown entities."

Always prefer the shorter hypothesis.

Need an example about what length of a hypothesis is?

Which of the following decision trees have smaller length? A or B?

Even without a precise definition of 'length' of a hypothesis, I am sure you would think that the tree on the left (A) looks smaller or shorter. And you will be right, of course. Therefore, a shorter hypothesis is the one which has either less free parameters, or less complex decision boundary (for a classification problem), or some combination of these properties which can represent its brevity.

What about the 'Length(D|h)'?

It is length of the data given the hypothesis. What does that mean?

Intuitively, it is related to the correctness or representation power of the hypothesis. It governs, among other things, given a hypothesis, how well the data can be 'inferred'. If the hypothesis generates the data really well and we can measure the data error-free then we don't need the data at all.

Think of Newton's laws of motion.

They, when appeared first in Principia, did not have any rigorous mathematical proof behind them. They were not theorems. They were much like hypotheses, based on the observations of the motion of natural bodies. But they described the data really really well. And, consequently they became physical laws.

And that's why you do not need to maintain and memorize a table of all possible acceleration numbers as a function of force applied to a body. You just trust the compact hypothesis aka law F=ma and believe that all the numbers you need, can just be calculated from it when necessary. It makes the Length(D|h) really small.

But if the data deviates from the compact hypothesis a lot, then you need to have a long descriptions about what these deviations are, possible explanation for them, etc.

Therefore, Length(D|h) is succinctly capturing the notion of "how well the data fits the given hypothesis".

In essence, it is the notion of misclassification or error rate. For a perfect hypothesis, it is short, zero in the limiting case. For a hypothesis, which does not fit the data perfectly, it tends to be long.

And, there lies the trade-off.

If you shave off your hypothesis with a big Occam's razor, you will be likely left with a simple model, one which cannot fit all the data. Consequently, you have to supply more data to have better confidence. On the other hand, if you create a complex (and long) hypothesis, you may be able to fit your training data really well but this actually may not be the right hypothesis as it runs against the MAP principle of having a hypothesis with small entropy.

Sounds like a bias-variance trade-off? Yes, also that :-)
Source: https://www.reddit.com/r/mlclass/comments/mmlfu/a_nice_alternative_explanation_of_bias_and/

Putting it all together

Therefore, Bayesian inference tells us that the best hypothesis is the one which minimizes the sum of the two terms: length of the hypothesis and the error rate.

In this one profound sentence, it pretty much captures all of (supervised) machine learning.

Think of its ramifications,

  • Model complexity of a linear model— what degree polynomial to choose, how to reduce sum-of-square residuals
  • Choice of the architecture of a neural network — how not to overfit the training data and achieve good validation accuracy but reduce the classification error.
  • Support vector machine regularization and kernel choice — balance between soft vs. hard margin i.e. trading off accuracy with decision boundary nonlinearity.

Summary and after-thought

It is a wonderful fact that such a simple set of mathematical manipulations over a basic identity of probability theory can result in such profound and succinct description of the fundamental limitation and goal of supervised machine learning. For a concise treatment of these issue, readers can refer to this Ph.D. thesis, called "Why Machine Learning Works", from Carnegie Mellon University. It is also worthwhile to ponder over how all of these connect to No-Free-Lunch theorems.

If you are interested in deeper reading in this area

  1. "No-Free-Lunch and the Minimum Description Length"
  2. " No Free Lunch versus Occam's Razor in Supervised Learning"
  3. "The No Free Lunch and Problem Description Length"


If you have any questions or ideas to share, please contact the author at tirthajyoti[AT]gmail.com. Also, you can check author's GitHub repositories for other fun code snippets in Python, R, or MATLAB and machine learning resources. If you are, like me, passionate about machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.

Like what you read? Give Tirthajyoti Sarkar a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.

  • Tirthajyoti Sarkar

    Medium member since Aug 2018

    Editorial Associate "Towards Data Science" | Sr. Principal Engineer | Ph.D. in EE (U. of Iilinois)| AI/ML certification, Stanford, MIT | Open-source contributor