Matt Cholick's Blog

Vibe Coding

2025-03-02T18:00:00Z

The app that I used for learning key signatures disappeared from the Play store between phones. Android does a decent job of transferring things over, but if something is gone from the store it won't come over to a new phone. The app has a few other features, but the one I used it for was a section where it presented key signatures for drills, flash card style. I did fully learn the major key signatures, but hadn't yet gotten to learning the minor signatures completely.

This seemed like a good chance to really experiment with LLM coding to see where things are. My day-to-day workflow is writing code and only reaching out to LLM chat for a syntax assist or rubber ducking (with the LLM editor body ghost text completions turned off - too distracting). For this, I changed that up quite a bit and used chat via Copilot Edits to generate ~95% of the code (Copilot Edits works by attaching files to the prompt, giving instructions, and then accepting or rejecting the diffs it generates).

I used Claude's 3.7 Sonnet model. It took about ~40 prompts and a couple of hours. Here's the final repo github.com/cholick/key-signature-flashcard-drills-pwa, and the app is deployed to https://key-signature.cholick.com. In most of the prompts I described the behavior I wanted, but a few were code specific (one example where it was easier to just tell it what was wrong is "The accidentalInfo and the way we're counting isn't working, currentKeySignature.key.length isn't giving what's needed.")

It worked. It got me to the app I was after in far less time than I would have been able to build it. My front-end skills are super rusty, and this is the sort of Javascript and DOM manipulation work that I really don't enjoy, so this wouldn't exist if AI hadn't coded it. I would have just found some other tool, used just physical cards, or just been lazy for a while and continued to count down three half-steps to get to the minor. The only part where I had to really reorient and tell it what to do was displaying the key signatures. I knew VexFlow would work, it's a library I've used before, but I didn't start with that since I was curious what the LLM would suggest. I didn't go full Vibe Coding, but for sections of it I did just do a skim, test in the browser, and then call it good.

Toward the end, AI started to make more mistakes. The style of "Here's the files, please implement a thing" definitely has limits as complexity increases. The mistakes weren't anything big, it just wasn't getting things right on the first try every time like it had at the start. Each fix, though, did just need describing and then the AI managed a fix in a single cycle. This seems like the sort of thing where just general good software practices would keep it on the rails (nothing too big, loose coupling, high cohesion). While cycling, it also didn't fully clean up at times. To clean things up I needed to occasionally prod it with a prompt along the lines of "Do we have any dead code, styling, or comments?"

One takeaway is that I should have committed much more often using these Edit/Agent workflows. I had a few cases where things went wrong, and VS Code's undo for the large, multi-file edits isn't great. A local multi-file rollback would have been really useful, something like JetBrains robust local history.

I think some sort of custom instructions would have improved things. "Comments should reflect the current state of the code only", for example, is a directive I think would have helped. Several times it encoded information about how it was cycling that wasn't helpful. A comment in that space, for example, was "Hide the controls container instead of just the start button". Some directives around styling consistency could be useful too, like "Always use lowercase letters in hex colors."

It was an interesting experiment. I learned that I'm not taking advantage of these tools as much as I could. Working with LLMs and code over the past couple years, it's been tricky to find that line where it is productive to reach for an LLM vs a waste of time. I had that well calibrated for a year or so back, but this exercise has shown me I need to move the line a bit and continue to experiment, because they've gotten better at an impressive pace.

For reference, here's the first 2/3 or so of my prompt history. Unfortunately I lost the last 1/3. I reset my session and it seems like that history is gone, but what's here is representative.

I'd like to create an app for doing flash drills of key signatures, to practice music. I want everything client-side, so the server just serves static files. An exercise will present a key signature and then have choices for all the options. It tracks success or failure, and runs for 5 minutes. I don't want to use react. Please lay out this project structure
"Identify the key signature" is just displaying a string. It should display a treble clef. What are some options for that?
How would I do this with a music font?
I don't see how the music font is rendering a key signature in what you've suggested
The font isn't rendering correctly, as you can see in this image
No, I want to stick with a font. I don't want to generate all those svg images.
It does display the correct number of sharps, but it doesn't display staff lines, and the sharps are not correctly placed on the staff lines
This is getting a bit much to manage and encode, and will have lots of styling fixes. Let's pivot to using VexFlow to render the key signature
The choices aren't quite presented correctly. I'd like all options available for every exercise, and all key signatures as an option. So there's no need to create an array of objects like that, with a signature mapped to choices. I don't think your approach is going to work. You're just tossing up the treble clef, and not rendering staff lines with the correctly placed sharps.
When the page is first loaded, it should only have the start button
I'd like to display using ♭, not b, for flats
Let's put the buttons in a grid
The alignment of the flat symbols seems off, they're too high. Are they vertically aligned somehow?
Can we remove some of the empty spacing at the end of the stave?
And can we make the clef and stave bigger? Increase the size of what's rendered, not just adding padding
The timer that ticks each second is distracting. How could we display that more subtly?
No, I'm thinking of something else. Maybe just a subtle bar that shrinks down
This code is removing and re-adding all the choices every question. It only needs to do that once
Let's change start from disabled to hidden during the exercise
Rather than the alert, is there a more CSS (and not too complicated) way to display a modal or overlay?
The modal is displayed before the timer bar fully depletes, something is off by a little bit
When the exercise restarts, the timer bar fills up in an animated way. How can it fill instantly?
When incorrect is answered, right now it's just going to the next value. I'd like to somehow tell the user the correct value. I don't want to add an additional click, and I don't want the UI to jump around. I'm thinking a something like displaying a "Correct" or "Incorrect: 3 # is A Major" that fades at the same times as the next exercise.
Let's remove the fade part after the correct/incorrect. Let's leave the text, change it to, for example, "Correct! A is 3 sharps" or "Incorrect: 2 sharps is C#. You answered D"
The accidentalInfo and the way we're counting isn't working, currentKeySignature.key.length isn't giving what's needed.
The explicit count makes sense, but that if/else if block is pretty ugly. Let's add that "3 sharps", for example, bit to the keySignatures array of objects and pull it out when we need it
Do we have any dead or unused code or styling?
I don't like the way the signature moves down after the first answer. Let's leave space for the correct/incorrect text even before it first shows up
That broke things. The spacing is there, but the feedback never shows up
Let's put the feedbackDiv in the HTML, it can just exist on the page and doesn't need to be inserted by js

In Defense of the Idea of Recall (But Not the Implementation)

2024-06-08T19:57:16Z

I'm quite surprised at Windows Recall's reception; it's almost a visceral horror at the idea.

I say this as someone who is very thorough with respect to security and privacy. For example, I only do online banking on a virtual machine. I'll think about any web search and ask the question "Do I want this in my history?" and use a privacy-protecting search if not (for example, anything medical or health related doesn't go through a signed-in session). I've never trusted anything like Mint, and for years have downloaded CSV transactions - parsing, categorizing, and graphing things myself in Python.

I'm also well organized. I build information hierarchies and robust systems of personal and professional notes. Years ago, I ran various software for personal wikis (my very first pull request was to a git-based wiki Gollum in 2012 as one iteration of that system). Now it's trees of markdown files. I have mail archiving/tagging/filtering rules so long I have to split the match field into two different rules.

With that context established, as someone both well organized and very much on the secure/private end of the spectrum, I think Recall would be really freaking useful. I absolutely want a well implemented, secure version of this. We're finally at the point where a large class of new interactions with computers are possible, the sort of things folks have been imaging for decades. Here's a 2008 story about Gordon Bell attempting to capture every moment of his life for a decade.

Bell has gone on to collect images of every Web page he has ever visited and television shows he has watched. He has also recorded phone conversations, images and audio from conference sessions, and with his e-mail and instant messages.

And this bit is just amazingly prescient:

In 20 years, digitizing our memories will be standard procedure, according to Bell. "It's my supplemental memory and brain," he noted. "It's one of my most valuable possessions. I look at this thing and think, 'My whole life is there.'"

These ideas aren't new; it's just that they're finally becoming readily possible. I remember seeing Rewind.ai last year, thinking it sounded extremely useful, and ruling it out because it doesn't process fully locally. Just like I remember seeing Mint locally years ago, thinking it would be quite useful, and rejecting it in favor of writing local Python: it's just too much to trust off-device with where the industry's security is now (and will be for any foreseeable future).

I do think a secure implementation is possible, but it looks like Windows didn't even try from Reading Kevin Beaumont's great post (and turning it on by default was just 🤦). What I don't agree with, though, and what seems to be a broadly shared sentiment, is:

In practice, that audience's needs are a very small (tiny, in fact) portion of Windows userbase - and frankly talking about screenshotting the things people in the real world, not executive world, is basically like punching customers in the face.

Really? I would love to be able to ask "What was the PR I left a comment on asking for several changes, but approved?" or "What are the cocktail recipes I was looking at last month with Yellow Chartreuse?" There are just so many things that are gone and quite hard to find if not explicitly captured in the moment. Even in just a purely personal context, I have folks I literally chat with in 4+ channels (Instagram DMs, texts, email, and Google Chat - and that's just one person). Every platform is trying to capture its users inside the boundaries of their own walled garden.

Aggregating our digital information in a way that's queryable via natural language is one of the most useful things I can imagine. An implementation with great defaults (the preview they shipped didn't even exclude Chrome private windows, just Edge... seriously?), robust user-controlled rules, TTL, commands like 'erase the last 5 minutes,' and much better-protected local data would be extremely useful.

Maybe this can't ever be done securely, but new modes of interacting with and making use of the ever-growing caches of digital information we generate are just becoming possible in a way that I think folks are going to find very, very useful.

Modern CI/CD Is a Directed Graph of Containers

2020-10-04T17:51:56Z

I had quite a difficult time figuring out how exactly GCP's Cloud Build works from reading docs and articles. The marketing material describes its technical functionality poorly; I needed to write code and dive in to figure out how it really behaves. I found the lack of good examples frustrating enough that I decided to write up a post and the code to hopefully save someone else a bit of time. My confusion came from the fact that I thought it was more than it actually is. Cloud Build really just boils down to a triggered chain of containers that are executed with persistent state mounted across steps. It does include a few nice convenience integrations into the larger platform, like auth, but there really is no magic.

My personal stuff has been running on VMs managed by Ansible for quite a few years now. This paradigm made sense when I set it up, but a lot has changed in the intervening years. My playbooks are feature rich and include both blue-green deployment and fully deploying software from a newly provisioned VM. Not working in that space, though, has atrophied the skills I need to maintain a large playbook.

Containers make a lot more sense today anyway, especially with all the layers of sugar that various clouds have built on top of basic orchestration. One of the features I've missed for a while is full automation post-commit: I'd like to be able to make a small edit directly on GitHub and have that change automatically built, tested, and deployed. It's been possible for quite a while. The latest full server rebuild from playbooks I barely remember finally motivated me to invest.

I've been using Concourse for quite a few years, and I have become quite fond of it (once I made peace with its statelessness). For something small, though, it's a bit heavy. I also don't like having the engine I need to restore software from scratch in the same cluster for disaster recovery reasons (my budget for toy projects is a single cluster). There are quite a few of hosted solutions outside of Concourse that would address that need, but starting with GCP's native offering is a pretty low friction choice, especially when the free tier would completely cover my needs.

Enter Cloud Build. I've skimmed the docs and once attended a talk, but I hadn't understood the core of it. I think I must have skipped over the key sentence that summarized the tool: Cloud Build just executes any standard container and does so in a context with some shared state. This model absolutely makes sense, but I had a different impression. Having used Concourse for so many years, as well as briefly testing out Drone, Jenkins X, and CircleCI, this is definitely the paradigm that modern CI/CD systems have settled on. The containers are run sequentially (with parallelization possible), steps return a non-zero exit code to indicate failure, and state is piped to the next container or otherwise stored. That's it; all the modern systems boil down to that and differentiation is just UI and various convenience features.

While trying to make a complete pipeline example work, I found two aspects of the tool confusing. The first of those is Cloud Builders. I came away from docs and examples assuming these were a first class concept, that there is a specific contract between a builder and the system executing it. There isn't; there is no special sauce in the builders. My suggestion is to mostly stick to other containers. For a lot of functionality, there is likely a better maintained and documented container out there.

Images were the second misleading concept for me. The cloudbuild.yaml file can have an images key. Using that, though, doesn't make the image available in GCR until the completion of the pipeline run. That's too late for the pipelines I want to build: I expect a pipeline to unit test, build, and deploy into a cluster. The deploy step doesn't work in this scenario, though, because images aren't available in GCR for a pull by the target cluster. Pipelines have to perform their own push when a pipeline step drive that pull.

I've put the full Cloud Build example pipeline code up on GitHub. The pipeline unit-tests, compiles, lints, builds the image, pushes the image, deploys to a Kubernetes cluster, and tests the deployed workload. The images key in the build yaml doesn't affect how the pipeline works, but it does add a link to the built image in the GCP UI.

unit-test: Runs unit tests using the official Golang image
build-binary: Builds the linux binary, which will be available to subsequent steps. This and the next few steps are executed in parallel using the waitFor value of "-"
helm-lint: Lints the Helm chart
go-lint: Lints the Go code
build-image: Builds the docker image. Subsequent docker commands will have access to the image
push-image: Pushes the image to GCR
install-dev: Installs the software (via Helm) into a cluster controlled by the step's environment variables. This is a one step where a builder did prove useful
prep-e2e: Installs the end-to-end Python tests' pre-requisites. The target flag coordinates with PYTHONPATH in the subsequent step
e2e: Performs end-to-end tests via Python

The full build file follows.


images: [ 'gcr.io/${PROJECT_ID}/cbt:${REVISION_ID}' ]

steps:
- id: unit-test
  name: "golang:1.15"
  env: [ 'GO111MODULE=on' ]
  args: [ 'make', 'test' ]

- id: build-binary
  name: "golang:1.15"
  env: [ 'GO111MODULE=on' ]
  args: [ 'make', 'build-linux' ]
  waitFor: [ '-' ]

- id: helm-lint
  name: 'gcr.io/$PROJECT_ID/helm-builder'
  args: [ 'lint', 'deployment/cbt', '--strict' ]
  waitFor: [ '-' ]
  env: [ 'SKIP_CLUSTER_CONFIG=true' ]

- id: go-lint
  name: "golangci/golangci-lint:v1.31"
  args: [ 'golangci-lint', 'run', './...','--enable', 'gocritic,testpackage' ]
  waitFor: [ '-' ]

- id: build-image
  name: 'docker'
  args: [
      'build', 'deployment/docker',
      '-t', 'gcr.io/$PROJECT_ID/cbt:$REVISION_ID',
      "--label", "org.opencontainers.image.revision=${REVISION_ID}",
  ]

- id: push-image
  name: 'gcr.io/cloud-builders/docker'
  args: [ 'push', 'gcr.io/${PROJECT_ID}/cbt:${REVISION_ID}' ]

- id: install-dev
  name: 'gcr.io/$PROJECT_ID/helm-builder'
  args: [
      'upgrade', 'cbt-dev', 'deployment/cbt', '--install',
      '--wait', '--timeout', '1m',
      '--namespace', 'dev', '--create-namespace',
      '-f', 'deployment/values-staging.yaml',
      '--set', 'image.repository=gcr.io/$PROJECT_ID/cbt',
      '--set', 'image.tag=${REVISION_ID}',
  ]
  env: [
      'CLOUDSDK_COMPUTE_ZONE=us-central1-b',
      'CLOUDSDK_CONTAINER_CLUSTER=hello-cloudbuild'
  ]

- id: prep-e2e
  name: 'python:3.8-slim'
  args: [
      'pip', 'install',
      '--target', '/workspace/lib',
      '--requirement', '/workspace/test/requirements.txt'
  ]

- id: e2e
  name: 'python:3.8-slim'
  args: [
      'python', '-m', 'unittest', 'discover',
      '--start-directory', 'test',
      '--pattern', '*_test.py'
  ]
  env: [ "PYTHONPATH=/workspace/lib" ]

Finally, these are the two references that I found the most useful:

The full build file syntax specification
The documentation that describes substitution and lists all the platform-provided values

Nine Years of 140 Characters

2017-10-01T00:19:58Z

Twitter recently announced they're testing a 280 character limit. There's a graph in the post that shows 9% of English Tweets have 140 characters. This was a surprisingly high percentage to me; it takes me several edits to hit the limit exactly, and I didn't think that many people went to the effort. I recall a few instances of looking up things like unicode ellipsis so that I could sneak in under the limit.

Curious how my own content stacked up against that average, I downloaded my Twitter archive and then wrote some Python to parse my tweets. Here are a few simple stats (excluding @replies and retweets):

Year	Tweets	Mean Length	Length ≥ 135	Length = 140
2017	63	113	41%	21%
2016	109	113	36%	18%
2015	108	112	34%	12%
2014	123	111	35%	18%
2013	139	108	29%	15%
2012	252	95	20%	11%
2011	281	78	10%	6%
2010	211	71	3%	1%
2009	457	71	2%	0%
2008	358	78	8%	3%

2008 was a long time ago, and I definitely used Twitter differently (as the table shows). Here are a few representative tweets:

I'm about to watch the best 30 minutes of animation ever created: Futurama's Roswell That Ends Well.
Interesting wired article: https://www.wired.com/2008/04/ff-wozniak/
The dude abides.
Is it obsessive compulsive to dump out a bag of skittles and sort them by color?
Java has tainted me - I like xml now... it almost doesn't seem too verbose.

They content is very random. Reading back over them, I can empathize with people advocating for the right to be forgotten. Tools like Twitter came along when I was old enough to know to not act too stupidly online, but still.... those early Tweets are so banal. They're thoughts that should be ephemeral, but instead sit there, frozen there in amber, until the end of time. Those early years had a very low signal-to-noise ratio.

Wind the clock forward nearly a decade, and I'm definitely making much better use of all those characters. Here are a few recent and representative tweets:

Deep neural network generated text-to-speech: https://deepmind.com/blog/wavenet-generative-model-raw-audio/ The audio samples near the end are amazing.
A sobering post by @briankrebs, 'The Democratization of Censorship' https://krebsonsecurity.com/2016/09/the-democratization-of-censorship/ Great to see @google step in w/ Project Shield.
Sad that I discovered @LuckyPeach only months before its demise.
Vendoring & GOPATH create enough pain & annoyance that every time I step away from Go & return, I seriously consider dropping the language.
Crazy bimodal distribution of reviews: https://fivethirtyeight.com/features/al-gores-new-movie-exposes-the-big-flaw-in-online-movie-ratings/ Wonder how often distilling movie to mean score obscures this kind of detail?

Nine years later, I think those examples are about as good as the medium can get in 140 characters. It's enough to let a folding magazine's writers know that they'll be missed. It can give enough context to a link so that a reader knows it's worth clicking. It's possible to complain about a single, concrete thing (GOPATH...) in hope of soliciting links to a solution. It's enough to broadcast a position on some contemporary issue.

I don't think Twitter is broken. I just think it's not a medium that can ever facilitate discussion: it's a mistake to think that discussion there is even possible. Nor is nuance possible. It's a broadcast medium for simple thoughts, links, and photos. 280 characters won't change that.

Slack Overload: A Frustrated User Rant

2017-01-22T17:14:13Z

Slack introduced threading. I had such high hopes for this feature. But... it misses the mark and doesn't solve the problem that Slack creates.

First, some background:

Slack is not optional: I'm on a distributed team but, even when I haven't been, using Slack is a requirement.
People say important things in Slack: This shouldn't come as a surprise, but given a communication channel, people say important things. Since slack is not optional, people expect others to read the messages they write. This is a pretty reasonable perspective.
People say unimportant (to me) things in Slack: Again, no surprise here. Assuming what they're saying is relevant to the channel, it still might not be specifically relevant to me. In the context of a single team this is true, but it's even more applicable in a larger organization. Even after quickly leaving low-relevance channels and muting others, I still keep tabs on more than a dozen channels.

Taken together, all these things result in an information filtering problem. Across all these channels are a subset of messages that I do want to read, but any given message has no context. Over and over again, I have to evaluate and dismiss messages that I could have ignored. Context is critically missing.

For a contrasting example, every email message has a subject line. I can quickly evaluate any message and ignore those not important to me. To manage email further, one can do things like add filters or mute discussions. For example, my credit cards automatically send me an email for every transaction. I tag these emails and trigger a specific notification on my phone, so I know immediately any time my cards are charged. I archive an email immediately if it's a charge I'm expecting. That single tag provides complete context.

Managing email is a solved problem. Anyone who says otherwise hasn't put in the work to route and filter incoming information.

Email is, of course, not Slack. Chat requires different solutions. What's frustrating, though, is that other tools similar to Slack have solved (or at least helped to mitigate) the filtering problems they introduce. Flowdock is what I used before Slack (and no, Flowdock isn't a Slack clone: it came first). They built a workable solution to this problem years ago. Here's a screenshot:

The colored bar on the left gives enough context that the message flow can be chunked. At a glance, I can evaluate a given thread and dismiss it. In real-world usage, often the last several messages would be a single thread. This style of threading drastically reduced the problem of information overload; it facilitated quickly viewing a channel and categorizing entire blocks of messages as either dismissible or something to be read.

Slack's solution fails to solve the problem. It takes messages completely out of the channel body and thus will only be used on a small subset of any channel's messages. Once a team embraced Flowdock's threading, every single message had existing context or established a new one. Slack's threading doesn't support this, because a thread hides replies behind additional clicks.

With a different implementation, Slack could drastically reduce the amount of time I'm forced to spend reading and dismissing irrelevant information. I had such hopes that they would get threading right. Slack, please give me the context and tools I need to filter information.

Mourning the Open API

2016-10-02T14:38:07Z

A few weeks ago, I got an email from Rotten Tomatoes letting me know that their API is going private. I should "re-apply via the Business Proposal Form" to get continued access. This is actually the third and final major API to close that I used to build my master's project several years back. That software, or something like it, would be impossible to build today.

For the project, I built a collaborative recommender system for movies based on tweets. The system had two large components. First, I built a classifier to decide if a tweet was positive or negative. To build the training set, I attached to Twitter's firehose and searched for tweets containing expressions like :) or :(, using them as a noisy label. Today, the firehose now requires special permission; developers can no longer just start exploring this data or building something.

Once I'd built a classifier, I needed a collection of accounts that had tweeted about several movies. Tospy was my source here. Even then, Twitter didn't offer historical access. Topys provided a freemium API and search that let me build a dataset about older movies, which I needed to build a large enough collection of different items for recommendation. Topsy was purchased by Apple in 2013 and shut down in 2015. Today, there is no free source of historic tweets.

To build an informational page for my recommendations (links to reviews, poster art, and other information), I used Rotten Tomatoes. This let me put together a page for each movie without manual data entry. This API is now private.

Finally, as a new user entered the system, I read their entire timeline to find tweets about movies. This is what let me calculate similar users for collaborative recommendation. I also read the entire timelines of users surfaced by Topsy in an effort to build a larger dataset. Twitter's API changes in 2012 would have made this part much harder (specifically the rate-limiting). I likely wouldn't have been able to get sufficient data in time, as I ran data collection for weeks at the higher rate-limit to build my recommender.

Running through this list, I'm reminded of Anil Dash's The Web We Lost. He builds a fantastic parallel between privately owned public spaces and technology platforms. There's a lot to that topic, which is worth visiting, but it's tangential to this discussion. More to my point, he talks extensively about the drive toward a consolidation of a diverse ecosystem into a few massive, non-interoperable giants that view their platforms as a walled garden. He also contrasts Flickr and Instagram. The former cares about metadata, and that is what makes so many things possible.

I really can see a stark contrast between Flickr and Instagram. Built years apart, the former embraces concepts like metadata, creative commons licensing, an API, and all the things that it possible to pull its photos and make them a part of something else. I even found a 2007 book dedicated to Flickr mashups. In contrast, Instagram requires pre-approval of apps. It took years for Instagram to come to the web from mobile and years more before even basic things like web search were in place. Instagram's content is locked away, reflecting the walled garden the app was born in.

I miss the perspective of "Here's access to something that's uniquely our users' via an API; go build something we can't imagine." I hope that isn't a luxury that disappears as soon as a stock is public or growth slows down. Platforms need to make money; as a developer, Twitter and Rotten Tomatoes don't owe me anything. But... they do owe their users. These platforms are stewards and aggregators. Locking away this information does deprive their community. Whether it's something as silly as Klouchebag or something more profound, like Chicago tracking food poisoning, the web is a better place when we share.

It's sad to see all this interesting data disappearing behind walls.

Quantified Self Meets IDE: A Year of Data

2016-04-11T03:44:09Z

More than a year ago, I started tracking exactly what code I was working on using WakaTime. As I've moved from specialized to more generalized as a developer, I wanted some real data to know where I'm focusing; data I could use to drive decisions. Am I getting a picture of the full stack? Was our team too focused on operations this sprint? Am I learning what I want to be learning?

The Data

Now that I have a solid year's worth of data, some analysis is in order. The question I want to explore today is "How does that time spend coding breakdown by language?" We do mostly pair programming, so the actual numbers I've captured aren't accurate, but I believe the percentages are. WakaTime hooks into Sublime, Visual Studio, and the JetBrains tools, so it captures nearly all my source edits. It likely does underrepresent operational work, as it has no hooks into terminal. I've only put it on my work machine (that's what I was interested in measuring), so side projects aren't in the mix. One final thing to note is that the data included 8% of my time bucketed into "other", which was a grab bag of scratch buffers in various technologies, config files, and other random things. Anyway, here's a year of data broken down by language:

Several couple things jump out for me:

18% infrastructure: Though operations work in general is underrepresented, infrastructure as code isn't. All the Bash and YAML (Ansible and Bosh) fall in this bucket. So, I've spent ~18% of my time coding up our infrastructure.
12% front-end: The division between TypeScript and JavaScript is useful, because for historical reasons the UI is TypeScript while JavaScript represents server side Node.js code. So, toss in the HTML and that comes to ~12% front-end work.
60% back-end: Add up the JavaScript, Go, and C# to arrive at ~60% back-end work.
5% documentation: Markdown is either documentation, notes, or knowledge base articles. It's a higher percentage than I would have guessed.
4% data: ~4% JSON is pretty interesting, in that this is time just spent looking at data. There's a little bit of config files in the mix there, but when I looked over the data, it was basically time spent studying API requests and responses.
2% reading oss
: Ruby is an outlier, as I don't actually write Ruby code. This fraction of my time was studying the underlying open source software to figure out exactly how or why it's behaving a particular way. This wasn't even in an effort to change things, just reason about the platform we're building upon.

I have buzzword aversion to the terms "full-stack" or "devops", so I'm just going to call our team a collection of generalists. But, at least for this sample size of n=1, the graph is concretely what it looks like being a generalist working with in a team of other generalists.

Thoughts

I've spent quite a bit of time now both in specialist and generalist roles. One thing I really like with my current model, where each team members works with all the technology, is that folks have the understanding and mandate to solve problems anywhere. When very specialized, I've been frustrated at times not being able to solve issues outside my area. I actually wrote a little on this five years ago when I was much more specialized:

...to gain some experience. I'm a web developer, but at our shop we're very specialized. The developer's don't deal with server maintenance much. This specialization allows groups to be more productive, but it also makes troubleshooting issues on the border between the application and the server (class loading issues, for example) more difficult to deal with. It's frustrating not having the experience to deal with these kinds of issues.

My team, down to the individual, is empowered and expected to solve problems and improve all aspects of our product. That's a powerful concept.

I also think incentives are well aligned working this way. "You build it, you run it" means that we're on call, so we do spend the time to build monitoring and automation; one can see it above as I spent 1/5 of my coding time on automating infrastructure. I have been pulled from sleep in the middle of the night by my software, and it really does change one's perspective. Delivering something reliable becomes more important. There's a very real moral hazard when reliability is divorced from feature delivery.

This applies to other areas as well. We're also both building and consuming our APIs, so we strike a balance between pragmatism and hardcore HATEOAS. When our customers are confused and generate support tickets, we have the incentive to write up examples and improve the user experience. There's neither handoff, nor transition, nor gaps.

There are disadvantages to generalization as well. The 10,000 hour rule isn't actually a thing (the original source for the number in Outliers, in fact, wrote an amusingly titled rebuttal "The Danger of Delegating Education to Journalists"), but it is convenient shorthand for a lot of time invested in something. I've spent that many hours and more working on the JVM.

In contrast, looking at the graph above, I've spent my more recent time spread across many languages and even more frameworks. Despite not really touching Java or Groovy for the past year and a half, it still can feel more familiar then Go or TypeScript. Never focusing long enough to gain real expertise, I sometimes find myself googling the most fundamental bits of language syntax. What isn't visible in the graph above are the context switches. We might go multiple sprints without touching any Node.js or Go code. With these switches, I'll forget how to receive on a Go channel or Typescript's delimiting character for multiline strings.

Another disadvantage is making decisions about technologies and frameworks in an absence of deep expertise. Choosing among upstart/monit/runit/systemd/daemonize is a recent concrete example. This is the sort of choice someone with years of Linux system administration experience would have a very informed opinion.

Sadly, I don't have a satisfying conclusion to this post. Some days, I miss that deep expertise and language facility that comes from working with the same language and framework for months on end: to know a thing fully. On the other hand, I really do love that I can build, maintain, monitor, update, and deploy full solutions with confidence.

SaaS and the Psychology of Ownership

2015-09-07T11:31:02Z

Recently, JetBrains announced they're changing their licensing to a SaaS style subscription. The reaction was... less than positive.

I was curious how negative the response actually was. Paging though comments it did seem overwhelming, but I wanted something less subjective. I took four posts with a large number of comments and fed them through text-processing.com's sentiment analysis API. The posts:

The code for analysis and raw data are all in this github repository.

Analysis Results:

Source	Positive	Negative	Neutral
JetBrains Post1 (Announcement)	183	338	119
Hacker News	98	240	116
Reddit	42	131	27
JetBrains Post2 (Listening)	100	197	58
Total	423	906	320

Negative comments outnumbered positive a little more than 2:1. Spot checking the analysis, I think the positive sentiment is overrepresented due to two types of comments: JetBrains staff responding on their own blog were almost always flagged positive and many commenters talked about how much they liked JetBrains products for the bulk of a post but then concluded they didn't like this change. The numbers are graphed below.

I'm absolutely an IntelliJ apologist. In teams, if someone complains about the tool, I speak up and defend it. I'll offer alternative ways of accomplishing something, submit a bug report, or even write a plugin. I've had a personal license for more than five years, which I originally purchased simply because work didn't upgrade on the first day of a major release. All that said, I do still feel uneasy about this change.

For me, I know it's not about the money. I would pay double what they charge for their software and not think twice about the purchase. Paging through the comments, I do see at least some others in a similar situation; some commenters acknowledge it would be cheaper for them, but they still don't like it. Many other commenters are complaining about the price, but I don't believe for a moment that's the core issue: a price increase would not generate this kind of furor.

I can't answer the question as to why everyone else is upset, but I can answer the question as to why I'm uncomfortable with the change. It's not the concept of SaaS in general that I have a problem with. In the last company I worked for, that's the type of software I actually wrote. Much of the music I listen to uses a similar pricing model: I pay a recurring fee where, if my subscription lapses, I'd lose access. Thinking more broadly, I pay a recurring "subscription fee" to live in my current house and am fine with that.

I tried to come up with a category of software where I would have similar issues. The one that immediately came to mind is the operating system. An analogous situation might be if OSX was subscription based and, if lapsed, I would still have all my files and data but no longer be able to use the operating system (the same way I'd still have source code if my IntelliJ subscription lapsed). I would likely switch back to Linux, similar to how many users are threatening to move to Eclipse.

So, what's makes the OS special? What makes me uncomfortable with a subscription model for this and for my IDE? The key common aspect I believe is how critical these software products are to getting work done. I watched a Cutthroat Kitchen episode the other day and one of the hinderances they added was making the chef cook with a single arm. That's how I feel when I'm using Sublime or Eclipse. And for something so important, I'm just flat out uncomfortable not owning it. It's not about the money. It's not about software phoning home (all my games are in Steam). It's not about a recurring fee (I upgrade IntelliJ every year). It's about renting the critical tools I use to get things done.

Everybody wants to be a SaaS, but JetBrains made a mistake. They just didn't understand their customers' relationship to the software they sell. It was avoidable though. Two years ago, they made a small step in this direction with the personal license change to a subscription for upgrade model. The strongly negative customer reaction should have predicted the current storm. They also could have surveyed customers. As a big fan and someone willing to give their company the benefit of the doubt, I still would have reacted with "Don't do this". The takeaway? As it is so often the case in software development, talk to your customers.

Ansible 101

2015-04-08T22:50:38Z

I like Chef. I think it's a reasonable solution to a very real set of problems. I've worked with the tool enough to know how all its pieces fit together: Chef itself, the nodes and environments, Chef-vault, Berkshelf, test-kitchen, and other elements of the ecosystem. I'm confident that I can modify a recipe to suite my needs or spin things up from scratch. I like their overall model, and I like that the tooling supports a test-driven flow for developing cookbooks.

Where I run into trouble with Chef is coupling its high complexity with infrequent use. Complexity by itself isn't necessarily bad: difficult problems can require complex solutions. My trouble is rooted in the fact that I'm a developer, not an operations engineer. I deal with Chef once every month or two. In that time, some piece of the Chef stack has inevitably drifted. Maybe I upgraded Vagrant. Or, more likely, some Ruby gem no longer works. Or I've forgotten some important detail about Berkshelf that's critical to getting a recipe all the way through to production. There's enough to the stack that, without fail, I'm debugging something broken in the tool or process itself rather than the server I'm trying to provision.

I've been on two teams now where a developer, frustrated by Chef, has started playing with Ansible and had nothing but praise for the tool. I finally decided to give Ansible a shot and adapted part of my EC2 vm's Chef recipes.

I decided to write up my experience, as I didn't find any articles covering what I'd call a complete flow: touching everything from laying out a new repository to setting up and running tests against a Vagrant virtual machine. For my sample cookbook, I'm installing a few packages, adding some configuration files, installing the HotSpot JVM from Oracle, and setting the hostname. For the full working example, clone my Github repository.

Ansible's best practices had some advice on directory layout, but it didn't break up the environments cleanly. @geedew's post here has a layout I prefer, as it better separates the environment specific configuration.

.
├── README.md
├── environments
│   ├── dev                     # development environment directory
│   │   ├── group_vars          # group variables for an environment
│   │   ├── host_vars           # host specific variable files
│   │   │   └── site_vm.yml
│   │   └── inventory
│   └── prod                    # production environment directory
│       ├── group_vars
│       ├── host_vars
│       └── inventory
├── roles                       # each subdirectory is a role
│   ├── common
│   │   ├── files
│   │   │   └── default.el      # files for the role
│   │   └── tasks
│   │       └── main.yml        # tasks, a main.yml is required
│   └── java
│       ├── files
│       │   └── install_jdk.sh
│       └── tasks
│           └── main.yml
├── server.yml                   # the master playbook
└── test                         # test directory
    ├── Gemfile
    ├── Rakefile                 # rakefile to run serverspec
    ├── Vagrantfile
    ├── spec
    │   ├── default              # serverspec tests
    │   │   ├── common_spec.rb
    │   │   └── java_spec.rb
    │   └── spec_helper.rb
    └── test.sh                  # test runner script

At the top level is an environments directory, where each subdirectory contains group and host variables and an inventory file. The inventory file describes the hosts to run playbooks against. Below is the dev inventory file. I specify a host, give it an alias, and configure the ssh user/key pair.

192.168.33.100 ansible_ssh_user=vagrant ansible_ssh_private_key_file=~/.vagrant.d/insecure_private_key

My example targets a single server. To test things out, I picked something simple that varied per environment: hostname. The file host_vars/site_vm.yml specifies all the host specific values for site_vm (192.168.33.100).

hostname: cholick.com.dev

Next are the roles. Below is my common role's main.yml. Like Chef, there are facilities for common things like installing packages, copying files, and setting the hostname. The install packages block makes use of Ansible's loops. I found the end result quite readable.

---
    - name: update cache
    apt: update_cache=yes cache_valid_time=3600
    sudo: yes

    - name: install common packages
    apt: pkg={{ item }} state=present
    sudo: yes
    with_items:
      - emacs23-nox
      - htop

    - copy: >
      src=../files/default.el
      dest=/usr/local/share/emacs/site-lisp/default.el
      mode=0644 owner=root group=root
    sudo: yes

    - hostname: name={{ hostname }}
    sudo: yes

The second role installs Java. Unfortunately, Oracle's JVM isn't in the Ubuntu repositories (a licensing thing if I remember correctly), so I scripted this part of the install (which makes for a better spike anyway). I know there are PPA that offer this, but I haven't had good luck in the past with PPA staying current. packagecloud.io could be a solution to setting up my own, but that's for another day. Here is Java's main.yml:

---
    - name: Check java
    shell: java -version || echo "undefined"
    register: java_version
    changed_when: False

    - name: Run install script
    script: ../files/install_jdk.sh
    sudo: yes
    when: "'Java HotSpot' not in java_version.stderr or '1.8' not in java_version.stderr"

The script is slow (downloading, uncompressing, and installing Java), so I protected it with a check to only run if the JVM isn't on the box. Output of the first task feeds into the second. install_jdk.sh is available in the repo for this playbook

Now we come to testing. Here is where I disagree most with the Ansible authors philosophically. Their documentation says:

"[...] it should not be necessary to test that services are running, packages are installed, or other such things [...] so when there is an error creating that user, it will stop the playbook run. You do not have to check up behind it."

Their perspective really misses the point and misses many things that unit tests touch:

A role might be perfectly written, but it might not be on the right hosts (or any)
Variables consumed by tasks and roles might have the wrong values
A valid package is installed, but not the correct wrong one
Tests help to describe the intent of the code. A test that checks that emacs is installed isn't necessarily checking up on Ansible, it's explicitly documenting that I expect the machine to have Emacs
They're a chance to fail fast, before the overhead of running in staging environments
Refactoring: changes to playbooks that successfully run, but no longer do the correct thing
TDD: I'm sure anyone reading this already has an opinion about TDD; mine is that it's the Right Thing™ to do

So, Ansible playbook test support isn't as integrated out-of-the-box as I would have liked. When I investigated how Chef did its testing, though, I found that Serverspec does much of what I thought was actually Test Kitchen. Serverspec was also quite simple to setup. After installing the gem, running "serverspec-init" asks a series of questions that generates a test harness.

First in the test stack is a simple Vagrant file, shown below. The file specifies an IP address (matching the dev inventory file) as well as an Ubuntu version.

VAGRANTFILE_API_VERSION = "2"
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
    config.vm.box = "ubuntu/trusty64"
    config.vm.network :private_network, ip: "192.168.33.100"
end

Rakefile, spec_helper.rb, and .rspec in the test tree were generated by Serverspec. The Gemfile, shown below, simply specifies that I want to install Serverspec.

gem 'serverspec'

Below I've included a simple test runner script (based on a post by servercheck.in). It ensures the vm is up and then runs the tests. Optionally, the script will also start from scratch and check for idepmpotence.

#!/bin/bash -e
if [ "$1" == "--full" ]; then
    vagrant destroy --force
fi

vagrant up

ansible-playbook -i ../environments/dev/inventory ../site.yml

if [ "$1" == "--full" ]; then
    ansible-playbook -i hosts ../server.yml \
        | grep -qE "changed=0\s+unreachable=0" \
        && (echo -e "Idempotence test: ${green}pass${clear}" && exit 0) \
        || (echo -e "Idempotence test: ${red}fail${clear}" && exit 1)
fi

rake

Finally, below are a few tests over the "common" role. They check for the existence of a package, ensure that the default emacs config has been copied over, and verify that the hostname is correctly set per the host_vars/site_vm.yml file.

require 'spec_helper'

describe package('emacs23-nox') do
  it { should be_installed }
end

describe file('/usr/local/share/emacs/site-lisp/default.el') do
  it { should be_file }
  it { should contain /backup-by-copying/ }
end

describe command('hostname') do
  its(:stdout) { should match /cholick\.com\.dev/ }
end

I was quite impressed with how quickly I was able to get up and running with Ansible. This simple start to exploring Ansible, though, didn't touch two areas that have caused me headaches while using Chef. I didn't learn how Ansible's manages community playbooks (How are they versioned? What sort of quality are they? Does the Ansible ecosystem have something analogous to Berkshelf?). I also didn't learn how difficult it will be to work on a living cookbook a months from now. I do like enough of what I saw, though, to start using Ansible in personal projects. It's a slick tool.

Two Years of Pair Programming

2014-12-21T20:34:41Z

For the last two years, I've built software using pair programming. I recently switched jobs; during this process, I talked to quite a few colleagues and researched practices at many companies. I came to realize that, as rare as pair programming is, rarer still is the way in which we practiced it. When many developers are discussing pair programming, they mean something much less intense than what I have in mind.

My team generally paired for the entire working day during a sprint. Small amounts of code were written by solo developers (for example, people came in during different times in the morning or someone took a vacation day), but this was the exception. Our physical space and technology setup also supported this style of work. Each workstation drove two 27" monitors, two mice, and two keyboards. Our desks had room for a laptop on the side, which we used for tasks like email and research. At the beginning of most sprints, we switched the pairs so that each team members had the chance to work with every other member over time.

After working this way for two years, I want to reflect on the practice and share my thoughts. In part, I simply want to evangelize pair programming; I very much believe this is a great way of working as a team.

There are many perspectives on team building and team cohesion. I like Tuckman's stages of group development: Forming → Storming → Norming → Performing. Performing is a great place to be as a team: we work together without (unneeded) conflict, we're motivated, we believe in our own skills and those of our teammates, and we feel like the team as a whole is greater than the sum of its parts.

The question is: how does a team reach the performing stage? Many practices, such as retrospectives, contribute to this growth. But for our team, I think pairing is the biggest answer to how we successfully got there.

Performing requires that team members communicate well. The act of working together all day, every day, teaches this. Effective pairing requires a continual discussion. Through sheer practice, team members learn to communicate effectively. I would know, for example, that while working with one team member a concept might take a white board discussion, while with a different member the same thing might instead require sketching out interface signatures.

Another aspect of a performing team is understanding and appreciating each member's abilities. Here, too, pair programming excels for the same reason. Writing code together, line by line, each developer learns very quickly the strengths and weaknesses of the other team members (as well as their own). Writing software as a cross functional team requires many things: programming language and library knowledge, experience with protocols, algorithm knowledge, building a continuous integration pipeline, writing and learning build tooling, operating system knowledge, writing deployment scripts and a myriad of other skills. Working directly together on each of these problems, I quickly learned my other team members' strengths.

Trust is implicit in the definition of a performing agile team. A development team is always working toward a shared goal. Pair programming, though, takes this to another level. Every day, each developer is working directly with a second person to accomplish a specific goal. Over time, this shared experience built trust much more quickly than in other contexts I've experienced.

There are other advantages to this style of working outside of team building. One experience that I recall clearly is teaching something to the person I was pairing with. The next week, he taught it to the developer he was pairing with. Shortly thereafter, I heard the thing I initially taught spread to a fourth team member. Knowledge spreads very quickly among team members practicing pair programming.

Pairing spreads other types of knowledge too. In any code base, there will be examples of both the wrong and the right way to accomplish something. Chances are greatly increased when two developers write code together that at least one developer understands the proper pattern to use.

Code quality is a tricky thing to quantify, especially on a young product with a lot of churn. I believe pair programming greatly improved quality, but my evidence here is more anecdotal.

A quote by Phil Karlton comes to mind:

There are only two hard things in Computer Science: cache invalidation and naming things.

One common experience that jumps out at me over the last couple years of pairing is having a discussion around naming variables, methods, and classes. Over and over again, naming generated a genuine conversation. Does that method name actually convey what it does? Does that name match common patterns? Would extracting that expression to a named variable add clarity? Naming is very important to a codebase, and, in my experience, pairing helped to give it the attention it deserves.

Good tests are an attribute of quality code; going a step further, I do believe that test driven development produces better code. It can, however, be easy to slip out of the habit of writing a test first. Sometimes one can know the right thing but rationalize not doing it. It's easy to tell yourself "I'll write the test after", or, on an especially bad day, "I can see the code is working, I'll skip the test for this particular case." Pairing helps to mitigate these kinds of things. Working with another developer, it's harder to rationalize not doing the right thing.

Finally, pair programming helped me to maintain focus. While working in a more open space, distractions can abound. It can be also be easy to lose mental context when an interruption inevitably comes up and then have to spend minutes reloading the right elements into working memory (working memory in the human sense). I found that pairing helped me focus and, when I do lose focuse, to more quickly rebuild context. Pairing is a constant discussion, and this discussion forms a bubble that blocks out distractions. My auditory sense was actively engaged during development. I felt much less prone to distraction. Upon losing focus, rebuilding context went more smoothly; across two developers, we much more quickly picked something back up.

Pair programming helped our team build cohesion, it taught us to trust and communicate, and the practice disseminated knowledge and improved code quality. I firmly believe in the practice.