SF Engineering Leadership Annual 2022

I had the good fortune of attending the San Francisco Engineering Leadership Community 2022 Annual conference, their first since the start of the COVID19 Pandemic. The event took place over 2 days at Fort Mason in San Francisco, had about a dozen vendors with booths, three “keynote” spaces and a handful of tables for smaller group conversations. In this blog post I’ll summarize some of my observations and learning from the two days:

Management Tips

Build & Staff Teams by Business Priority

Most companies will naturally devote time and energy to teams proportionate to their size. If you have six engineers divided into two teams of three, then it’s natural for each team to get roughly similar energy and attention across the business, even if one of the team is working on a problem that is meaningfully higher impact to the business. Be careful to design teams and team sizes to match business value, or to explicitly devote more energy to the higher impact groups.

Dev Experience is a Major Trend

Most successful tech organizations (Netflix, Google, Facebook, Slack etc.) spend at least 10% of all tech resources on developer experience and tools.

High Performing Teams

The Story Points fad is dead — every leader I spoke to or polled, about 20 in all, agreed that measuring and optimizing for velocity via story points (or similar) is not productive.

One way to identify high(est) performing teams: The 360 Team NPS Score. You survey other teams throughout the organization and have them take an NPS survey for your various engineering teams. You then ask each team to do an internal NPS score (commonly called an eNPS). If both the external and internal NPS scores come back good, meaning that other teams perceive the team as high performing, and the team internally is happy, then you’ve probably got a high performing team.

Team of Teams by Stanley McChrystal was referenced a few times, great book. In general there was a lot of focus on the impact of leadership empowering their teams; creating high stakes and strong mission alignment.

Remote Work

Remote is hardest on junior employees.  It’s hard for them to get interrupt driven help and mentorship.  Some common solutions include hybrid work (teams get together in person at least once a week), hiring fewer Juniors, ramping up juniors in person before going to distributed/remote.

Internal Q&A sites sound good on paper but don’t take off. Any manager I spoke to about Stack Overflow for teams (or Gleen or Threads) that said they tried it said it failed to catch on. 

Slack conversation is for ephemeral content. Identifying and porting knowledge from Slack to a wiki is a lossy process, nobody identified any robust or reliable processes other than “keep encouraging everyone to use the wiki.”

In hybrid organizations all in person employees should join remote meetings independently. This was widely agreed upon as the only way to ensure a productive hybrid meeting. 

Donuts have been medium successful at other remote teams. They’re better than nothing but not a silver bullet to creating serendipity and social connection.

I asked a focus group of 15 other engineering leaders what kind of team they think they’ll be managing in 10-15 years, 100% of hands went up for remote, 0% for in person.

Idea for minimizing “unread Slack channel anxiety” — denote some Slack channels as special and required reading, have people star those channels during on boarding.  Then post infrequent but important updates there.  Everything else should be assumed to be ephemeral

DORA Metrics

DORA Metrics are four metrics designed to measure the speed and quality of an engineering team. Sleuth, a company that measures DORA metrics, was at the conference and happy to espouse the benefits of continuous deployment.

Interestingly, at an earlier round-table in the day discussing high performing teams I asked how many other managers were tracking these metrics, and of about 15 people none indicated they were familiar with DORA metrics. Everyone knew what continuous deployment was though and universal sentiment was it is A Good Thing to strive for.

Memorable Quotes

“You can’t a/b test organizational changes or management decisions, especially in a growing organization”

“Don’t look back, you’re not going that way” — thinking about careers

“Tools dictate your process, your process informs your culture, your culture guides tool choice.”

Misc. Conference Tips

Small Groups are Key

In general the larger talks I found to be, on the whole, lower value than the small group conversations. That’s not to say they were of no value, it was nice to hear directly from some folks who have done very respectable things talk about their journey and get a sense of who they are as people. VP of Engineering at Vercel Lindsey Simon, in particular, has a great sense of humor and I found to be a very engaging speaker. The small groups, though, were considerably more thought provoking and is where I spent most of my time.

Being Curious is Key

On two occasions I went out of my way to be empathetic or curious with vendors at the conference. The first was an engineer asking some hard hitting questions to a sales rep of some SaaS software. They were fair questions, but to me it was clear the sales person was out of their technical depth and not able to produce a satisfying answer. After listening for a minute or so I took an educated guess as to what the questioner might be looking for, throwing a bone to the sales person and letting them off the hook for that line of questioning. Needless to say he was very thankful and we had an extended and honest conversation both about his product and the world of selling SaaS thereafter.

Not long thereafter I met the founder of Metaview.ai. He, as the curious customer-focused founder he is, started asking me about my company and interview process. I gave him the rundown, and then started to get very curious about his business. What motivated him to solve this problem, how does he think about interviews, how to provide fair/consistent interview experiences, philosophy on training teams to hire etc. We got into it for a good few minutes and I must have made a good impression as he forwarded me an invitation to a dinner his company was curating that evening. I graciously accepted and that dinner turned out to be one of the highlights of the trip for me!

Fun / Random Knowledge

All the sounds in Slack came from their first idea that they pivoted from: building a video game

Vercel is pronounced ver-sell not versil

Photos

Jon Hansley, CEO of Emerge, a product consultancy in Oregon, discussing Alignment
Free book #1
Free book #2

 

Setting up budgets for cloud usage with Terraform

From time to time your team may want to use a new service from your cloud provider. That request may come with an estimated usage cost for the service and if it fits in the budget and seems good ROI it will be approved.  For most startup projects, that’s where the cloud cost control ends. With just a bit of extra effort, especially if resources are already being provisioned with Terraform, you can use budgeting tools offered by Amazon, Google etc. to ensure the actual cost aligns with expectations.

For the purposes of this example, I’ll use Google Cloud Budgets, but the analogous resources and APIs exist in AWS and Azure.

Goal: Add a budget to monitor the cost of a new Google Cloud Run service your team wants to deploy. 

Prerequisites: An operational knowledge of Terraform and editor access to a Google Cloud Project & Google Cloud Billing Account

Part 0 – Become familiar with your cloud provider’s budgeting tool

If you haven’t spent a few minutes creating a budget using the cloud console itself. The various parameters and options in Terraform will make a lot more sense if you’ve already got the context and perspective of how the budgeting process as a whole works. In Google Cloud budgets are under “Budgets & alerts” in the billing section.

Part 1 – Setup the cloud run project

This is just a sample directly from the terraform resource documentation

resource "google_cloud_run_service" "default" {
  name     = "cloudrun-srv"
  location = "us-central1"

  template {
    spec {
      containers {
        image = "us-docker.pkg.dev/cloudrun/container/hello"
      }
    }
  }

  traffic {
    percent         = 100
    latest_revision = true
  }
}

Part 2 – Find the service ID

When setting up a budget with Google Cloud you have the option to have the budget monitor cost of a specific service via a filter. The terraform resource for creating budgets with these filters requires you specify the service by the service’s ID. You can find the ID of various services in the cloud console UI as per the screenshot below.

Part 3 – Setup the budget

The below terraform code is lightly modified from the sample code in the google cloud budget terraform resource documentation

data "google_billing_account" "account" {
  billing_account = "000000-0000000-0000000-000000"
}

data "google_project" "project" {
}

resource "google_billing_budget" "budget" {
  billing_account = data.google_billing_account.account.id
  display_name = "Project X Cloud Run Billing Budget"

  budget_filter {
    projects = ["projects/${data.google_project.project.number}"]
    credit_types_treatment = "EXCLUDE_ALL_CREDITS"
    services = ["services/152E-C115-5142"] # Cloud Run
  }

  amount {
    specified_amount {
      currency_code = "USD"
      units = "100" # $100 per month
    }
  }

  threshold_rules {
    threshold_percent = 0.5 # Alert at $50
  }
  threshold_rules {
    threshold_percent = 0.9 # Alert when forecast to hit $90
    spend_basis = "FORECASTED_SPEND"
  }
}

You can even set up custom alerting rules so the teams that create new infrastructure at the ones notified if/when spend exceeds the amount forecast during planning and development.

Production SQL Server Checklist & Best Practices

Much ink has been spilled on which database you should use, or how to think about which database to use, for your project.  My aim in this post is not to sell you on any database paradigm, rather to serve as a reference guide and checklist for how to responsibly host a SQL server, be it MySQL, Postgresql or other, in production.

Before calling your SQL database plan ready-for-implementation, ask yourself if you’ve thought about all these requirements:

Read only replicas
Multi-zone hosting
☐ Automated daily backups
☐ One click rollback / backup restore
Event based audit logging / full table histories / log replication
☐ Automatic disk expansion
☐ High quality Migration tooling
☐ Connection/IP Security
☐ Local dev versions / easy ability to download prod data
Staging / recent-prod replica environments
CPU & Memory monitoring / auto scaling
Slow query monitoring
☐ High quality ORM / DB Connection library

In practice it’s very expensive or impossible to do all of these things yourself, your best bet is to chose a solution that comes with many of these features out of the box such as Google Cloud SQL or Amazon RDS; just make sure to enable the features you care about.

—————-

Read only replicas

More often than not a production SQL server will have use cases that can easily be divided between read heavy vs. write heavy.  The most common is perhaps the desire to do analytics processing on transaction data.  Generally this should be handled with a proper data pipeline/enterprise data warehouse, but having a real time readonly mirror is a good practice regardless even for your ELT tools.

Multi-zone hosting

If AWS us-east-1 goes down, will your application survive?  Have a plan to ensure data is replicated in real time between zones, or even better between datacenters entirely.  

Automated daily backups

Ideally you have at least daily, if not more regular, full backups that are sent off site.  Depending on your requirements perhaps that’s an exported zip file to a storage bucket with the same cloud provider, or perhaps it’s a bucket in an entirely different cloud.  Make sure that everything about this process is secure and locked down tight, these are entire copies of your database after all.  

This is a good use case for that realtime read only replica.

One click rollback / backup restore

Most cloud hosted SQL options will offer the option of one-click point in time restore.  At a minimum ensure you have an entirely automated way, that is tested regularly, to restore from one of your hourly or daily backups.  

Event based audit logging / full table histories / log replication

Different databases have different terminology for this, in PSQL they’re replication slots, in MSSQL it’s log replication.  The idea is you want CDC — change data capture — for every mutation to every table recorded in a data warehouse for your analytics team to do as they need.  Such data can be used to produce business audit logs, or run point-in-time analytics queries to ask questions for users such as “what was my inventory like last week?”

Automatic disk expansion

Nobody likes getting an alarm at 3AM that their database has hit its disk storage limit.  In case it’s not obvious, very bad things happen when a database runs out of disk space.  Make sure your SQL solution never runs out of disk by using a platform/tool that will expand automatically.  Ideally it shrinks automatically too.

High quality Migration tooling

Schema and data migrations are hard, don’t try and solve these problems yourself.  Use a tool or framework that will help you generate migrations and manage the execution of migrations across various environments.  Remember that your migration has to work locally for developers who have used this repository before and new developers, as well as in all staging, feature branch and production environments.  Don’t underestimate the difficulty of this challenge.

Connection/IP Security

Often you can get away with IP allowlisting access to a database, but in 2022 that’s going out of style (and will be flagged by PCI or SOC2 auditors). Nowadays your database should be in a private VPC with no internet access and networked/peered with your application servers. Keep in mind that this will make access for developers challenging, that’s a good thing!  It’s a good idea to have a strategy, either with a proxy or a bastion host, for emergencies though.

Local dev versions / easy ability to download prod data

You’ll want tooling to download a copy of sanitized production data for testing.  Something that runs well on a local machine with 1000 rows may be unacceptably slow in production with 2 million records.  Those 2 million records may cause trouble not just due to volume, but also data heterogeneity — real world users will hit edge cases your developers may not.  

CPU, Memory, Connection monitoring / auto scaling

Ensure you have monitoring and, ideally autoscaling, on cpu, memory and connection counts for a SQL database.  It should be somebody’s job to check from time to time that these values are within acceptable ranges for your use case.

Cost Monitoring

SQL databases are generally some of the more expensive parts of the stack. I recommend you set up a budget using tools in your cloud provider so you know how much you’re spending and can monitor growth.

Slow query monitoring

It’s easy to shoot yourself in the foot with SQL, whether using an ORM or writing raw SQL, and generate very expensive and slow queries.  You’ll want logging and ideally alerting for anything abnormally slow that makes it to production.

High quality ORM / DB Connection library

Don’t forget about developer experience!  Do you want to be writing raw SQL or using an ORM/DAL?  There are tradeoffs in both directions, think through your options carefully.  Does the ORM come with a migration tool?  Does it have built-in connection pooling?  

A code-free IaaC link shortener using Kutt and GKE

Goal: Deploy a link shortener on your own domain without writing any (non-infrastructure) code.

Prerequisites: An operational Kubernetes cluster, knowledge of Kubernetes & Terraform, basic knowledge of Google Cloud Platform

At every company I’ve worked at in the past decade we’ve had some mechanism to create memorable links to commonly used documents.  Internally at Google, at least when I was there around 2010, they used the internally resolving name “go”, e.g. “go/payroll” or “go/chromedashboard” would point to internal payroll or internal project dashboards.  I suspect an ex-Googler liked the idea enough to make it a business, as GoLinks is a real thing you can pay for.  Below I’ll walk through how to setup Kutt (an open source link shortener) with Terraform in your own Kubernetes cluster in Google Cloud.

Kutt has several dependencies, so let’s make sure we’ve got those in orderg

  • You need a domain name and ability to set DNS records.  For example go.mycompany.com
  • You’ll need an SMTP Server for authenticating emails to the link shortener, we’ll use this just for the admin user.  Have your mail_host, mail_port, mail_user and mail_password at hand.
  • Optionally: A google analytics ID
  • A Redis Instance (we’ll deploy one with terraform)
  • A Postgresql database (we’ll deploy one with terraform)

For starters, let’s setup our variables.tf file. There’s quite a few values here, and there are more configuration options that can be passed into Kutt via env vars down the road.

variables.tf

variable "k8s_host" {
  description = “IP of your K8S API Server”
}

variable "cluster_ca_certificate" {
  description = “K8S cluster certificate”
}

variable "region" {
  description = "Region of resources"
}

variable "project_id" {
  description = “google cloud project ID”
}

variable "google_service_account" {
  description = “JSON Service account to talk to GCP”
}

variable "namespace" {
  description = “kubernetes namespace to deploy to”
}

variable "vpc_id" {
  description = “VPC to put the database in”
}

variable "domain" {
  default = "go.mycompany.com"
}

variable “jwt_secret” {
  default = “CHANGE-ME-TO-SOMETHING-UNIQUE”
}

variable “smtp_host” {}

variable “smtp_port” {
  default = 587
}

variable “smtp_user” {}

variable “smtp_password” {}

variable “admin_emails” {}

variable “mail_from” {
  default = “lnkshortner@mycompany.com” 
}

variable “google_analytics_id” {}

Now let’s set up our database.  You can really do this anyway you like, but if we’re using Google Kubernetes Engine we likely also have access to Google Cloud SQL, so this is fairly straightforward.

database.tf

resource "google_sql_database_instance" "linkshortenerdb" {
  name             = replace("linkshortener-${var.namespace}", "_", "-")
  database_version = "POSTGRES_13"
  region           = "us-west2"
  project          = var.project_id
  lifecycle {
    prevent_destroy = true
  }

  settings {
    # $7 per month
    tier = "db-f1-micro"
    backup_configuration {
      enabled                        = true
      location                       = "us"
      point_in_time_recovery_enabled = false
      backup_retention_settings {
        retained_backups = 30
      }
    }

    ip_configuration {
      ipv4_enabled = false
      # In order for private networks to work the GCP Service Network API has to be enabled
      private_network = var.vpc_id
      require_ssl     = false
    }
  }
}

resource "google_sql_database" "linkshortener" {
  name     = "${var.namespace}-linkshortener"
  instance = google_sql_database_instance.linkshortenerdb.name
  project  = var.project_id
}

resource "random_password" "psql_password" {
  length  = 16
  special = true
}

resource "google_sql_user" "linkshorteneruser" {
  project  = var.project_id
  name     = "linkshortener"
  instance = google_sql_database_instance.linkshortenerdb.name
  password = random_password.psql_password.result
}

That’s it (finally) for prerequisites, now the fun part, setting up Kutt itself (with a Redis sidecar)

kutt.tf

provider "google" {
 project     = var.project_id
 region      = var.region
 credentials = var.google_service_account
}

provider "google-beta" {
 project     = var.project_id
 region      = var.region
 credentials = var.google_service_account
}

data "google_client_config" "default" {}

provider "kubernetes" {
 host                   = "https://${var.k8s_host}"
 cluster_ca_certificate = var.cluster_ca_certificate
 token                  = data.google_client_config.default.access_token
}

resource "random_password" "redis_authstring" {
 length  = 16
 special = false
}

resource "kubernetes_deployment" "linkshortener" {
 metadata {
   name = "linkshortener"
   labels = {
     app = "linkshortener"
   }
   namespace = var.namespace
 }

 wait_for_rollout = false

 spec {
   replicas = 1
   selector {
     match_labels = {
       app = "linkshortener"
     }
   }

   template {
     metadata {
       labels = {
         app = "linkshortener"
       }
     }

     spec {
       container {
         image = "bitnami/redis:latest"
         name  = "redis"

         port {
           container_port = 3000
         }

         env {
           name  = "REDIS_PASSWORD"
           value = random_password.redis_authstring.result
         }

         env {
           name  = "REDIS_PORT_NUMBER"
           value = 3000
         }
       }

       container {
         image = "kutt/kutt"
         name  = "linkshortener"
         port {
           container_port = 80
         }

         env {
           name  = "PORT"
           value = "80"
         }

         env {
           name  = "DEFAULT_DOMAIN"
           value = var.domain
         }

         env {
           name  = "DB_HOST"
           value = google_sql_database_instance.linkshortenerdb.ip_address.0.ip_address
         }

         env {
           name  = "DB_PORT"
           value = "5432"
         }

         env {
           name  = "DB_USER"
           value = google_sql_user.linkshorteneruser.name
         }

         env {
           name  = "DB_PASSWORD"
           value = google_sql_user.linkshorteneruser.password
         }

         env {
           name  = "DB_NAME"
           value = google_sql_database.linkshortener.name
         }

         env {
           name  = "DB_SSL"
           value = "false"
         }

         env {
           name  = "REDIS_HOST"
           value = "localhost"
         }

         env {
           name  = "REDIS_PORT"
           value = "3000"
         }

         env {
           name  = "REDIS_PASSWORD"
           value = random_password.redis_authstring.result
         }

         env {
           name  = "JWT_SECRET"
           value = var.jwt_secret
         }

         env {
           name  = "ADMIN_EMAILS"
           value = var.admin_emails
         }

         env {
           name  = "SITE_NAME"
           value = "MyCompany Links"
         }

         env {
           name  = "MAIL_HOST"
           value = var.smtp_host
         }

         env {
           name  = "MAIL_PORT"
           value = var.smtp_port
         }

         env {
           name  = "MAIL_USER"
           value = var.smtp_user
         }

         env {
           name  = "MAIL_FROM"
           value = var.mail_from
         }

         env {
           name  = "DISALLOW_REGISTRATION"
           value = "true"
         }

         env {
           name  = "DISALLOW_ANONYMOUS_LINKS"
           value = "true"
         }

         env {
           name  = "GOOGLE_ANALYTICS"
           value = var.google_analytics_id
         }

         env {
           name  = "MAIL_PASSWORD"
           value = var.smtp_password
         }
         readiness_probe {
           http_get {
             path   = "/api/v2/health"
             port   = 80
             scheme = "HTTP"
           }
           timeout_seconds = 5
           period_seconds  = 10
         }
         resources {
           requests = {
             cpu    = "100m"
             memory = "200M"
           }
         }
       }
     }
   }
 }
}

With the above you should now have a postgres server, a redis instance and a kutt deployment deployed and talking to eachother. All that’s left is to expose your deployment as a service and setup your DNS records.

Kaizen in infrastructure: Writing RCAs to improve system reliability and build customer trust

I shall endeavor to convince you today that your company should really regularly write Root Cause Analyses (RCAs), not just for yourselves but also as a tool to build trust with your customers. The subject of RCAs can be a bit dry, so allow me to motivate with an example of how poorly-approached RCAs can easily become a hot-button issue that will cost you customers.

As the CTO of Lottery.com, I’m responsible for overseeing roughly 50 vendor relationships. For the vast majority of these vendors, Lottery.com (LDC) is a regular customer that doesn’t use their system much differently than any other customer, and things run smoothly. They charge a credit card every month, and we get a service that fulfills some business need. Yet, for a small handful of these vendors ‘business as usual’ just can’t shake persistent issues.

Allow me to tell you of an incident that occurred recently with a vendor; let’s call them WebCorp. WebCorp offers a service that we depend on for one of our services to be up. If WebCorp’s service is up, then our service is up. If WebCorp’s service goes down, we go down. In these situations, it’s in everyone’s interest for WebCorp to be reliable. So we codify that dependency in a contract called a service level agreement (SLA). SLAs are measured quantitatively, in terms of uptime percentage.

A side note on SLAs: A high quality service measures their uptime in “nines,” as in: three nines is 99.9% uptime. That may seem like a lot, but over the course of a year 3 nines of uptime translates to nearly 9 hours of timedown, or 1.5 minutes of downtime per day. With WebCorp, Lottery.com has a five nines SLA, which translates to five minutes of downtime per year.

OK, back to the story: Lottery.com got an alert in the early afternoon around noon PDT that our service was down. Within about five minutes we had determined, conclusively, that the cause of the downtime was a service failure at WebCorp. I emailed WebCorp’s emergency helpline and, to their credit, within a few minutes they acknowledged the issue and indicated they were looking into it. About an hour later they had resolved the issue and our service was back online. Total downtime was about 64 minutes.

When a vendor has an outage it is my standard practice, once the issue is resolved, to write in inquiring about what went wrong and whether mitigation steps to prevent future outages are in place. In this case, WebCorp’s response was:

It would appear that the cache flushed right before the system tried to restart. That flush wiped the contents of a Varnish file, which caused Varnish to restart with an error. That probably doesn’t mean much to someone on your end of things. Essentially, it was a really unusual conflict of a couple of automatic jobs happening on the server, so we’re fairly sure it’s not something you’ll be able to reproduce from your end of things, intentionally or unintentionally. Hope that clarifies a bit!

While I appreciate the effort to lift the curtain a little bit on some of the technical details, this response doesn’t actually tell me how WebCorp is going to prevent the issue from happening again. And so I asked them what they planned to do to prevent future such outages. 

WebCorp’s response:

We try our very best to prevent these things from happening. In order to be better prepared for a situation like this in the future, we’ve added extra monitoring […]. Now, our emergency support team will be immediately alerted whenever any downtime happens […].

Since the issue [your service] encountered is one that we have no record of having seen (either before or since), it might be premature to alter our Varnish caching processes at this time. If the issue proves to be reproduce-able and / or widespread, then we may indeed make an adjustment to our infrastructure to correct for it. For now, though, it appears to be an isolated incident.

While you do have a 99.999% SLA with us, it is actually for a different […] service! The SLA agreement is tied to [Service2] and not [Service1]. However, you may be pleased to hear that the uptime of [Service1] has been at 99.87% over the last month!

Again, I apologize for the downtime yesterday. I hope this answers the questions you had for me and the rest of my team. If not, please feel free to reach out again so we can continue the conversation. I’m always happy to help!

Again to WebCorp’s credit, this is an undeniably polite and professionally written response. The substance, however, did little to reassure me on a technical level.

What I read in the response, substantively, is:

We’re doing our best and will add more ‘monitoring’. In fact, our support team will now actually find out when downtime occurs. But this specific issue has never happened before, so it’s not in our interest to change business practices. Oh and as a reminder, the 99.999% SLA we have for your service doesn’t technically apply here and this service been at 99.87%. Isn’t that great?

By signing and paying for a five-nines SLA, my expectation as a customer is to have as close to 99.999% uptime as possible for all services WebCorp offers. The fact that WebCorp’s response seems to indicate that they find 99.87% to be a good uptime percentage serves to dramatically reduce the trust that I have in WebCorp’s future reliability. A far more reassuring response would indicate that they take all downtime seriously, that their team is investigating ways to improve the robustness of the system to ensure no customer experiences these outages ever again and that they would reply to me in a few days when they understand exactly what went wrong in their procedures and how they’ll be improving.

In summary:

1) It is important that the vendor and customer have aligned expectations for service reliability.
2) If the vendor offers a contractual SLA, the customer’s expectation is that the vendor will make good faith best efforts to meet that SLA, and take any breaches seriously.

RCA The Right Way

By not performing and being transparent about a detailed RCA, it’s easy for a customer to lose faith in a company’s efforts to provide a highly-reliable service. The goal of the RCA is therefore twofold:

1) Document the failure and potential mitigations to improve service quality and reliability.
2) Provide a mechanism for being transparent about failures to build confidence, and trust, with customers.

A good RCA has a template roughly as follows:

Incident Start Time/Date:
Incident Received Time/Date:
Complete Incident Timeline:
Root cause(s):
Did we engage the right people at the right time?
Could we have avoided this?
Could we have resolved this incident faster?
Can we alert on this faster?
Identified issues for future prevention:

In this template are prompts for the pieces of information one needs to understand what happened, what was learned, and why it won’t happen again. There are many great examples of RCAs out there:

https://blog.github.com/2012-12-26-downtime-last-saturday/
https://medium.com/netflix-techblog/lessons-netflix-learned-from-the-aws-outage-deefe5fd0c04
http://www.information-age.com/3-lessons-learned-amazons-4-hour-outage-123464916/
https://slackhq.com/this-was-not-normal-really-230c2fd23bdc

Kaizen, the Japanese term for ‘continuous improvement’ is an ethos often cited in industry. Building technology is hard; humans are imperfect and therefore, technology often is as well. That’s expected. The only way we get past our imperfect ways is to continuously work to get better, to own up to our mistakes, learn from them, and ensure we (and our technology) don’t make the same mistake twice.

The Commandments of Good Code according to Zach

    1. Treat your code the way you want others’ code to treat you
    2. All (ok most) programming languages are simultaneously good and bad
    3. Good code is easily read and understood, in part and in whole
    4. Good code has a well thought out layout & architecture to make managing state obvious
    5. Good code doesn’t reinvent the wheel, it stands on the shoulders of giants
    6. Don’t cross the streams!

Treat your code the way you want others’ code to treat you

I’m far from the first person to write that the primary audience for your code is not the compiler/computer, but whomever next has to read the code (which could be you 6 months from now!) Any engineer can produce code that ‘works’, what distinguishes crap from those that are capable of writing maintainable code efficiently that supports a business long term is an understanding of design patterns, and the experience to know how to solve problems simply and in a clear and maintainable way. The rest of these commandments are all supporting lemmas of this thesis.

All (ok most) programming languages are simultaneously good and bad

In (almost) any programming language it is possible to write good code or bad code ergo, assuming we judge a programming language by how easy it is to write good code (it should at least be one of the top criteria, anyway), nearly any programming language can be ‘good’ or ‘bad’ depending on how it is used (or abused).

An example of a language that by many is considered ‘clean’ and readable is Python. Many organizations will enforce a universal coding standard (i.e. PEP8), the language itself enforces some level of white space discipline and the built in APIs are plentiful and fairly consistent. That said, it’s possible to create unspeakable monsters. For example, one can define a class and define/redefine/undefine any and every method on that class at runtime. This naturally leads to at best an inconsistent API and at worse an impossible to debug monster. Yes, one might naively think, but nobody does that! Unfortunately that is untrue, and it doesn’t take long browsing pypi before you run into substantial (and popular!) libraries that (ab)use monkeypatching extensively as the core of their APIs. I recently used a networking library (xmppy) whose entire API changes depending on the network state of an object. Imagine calling ‘client.connect()’ and getting a MethodDoesNotExist error instead of HostNotFound or NetworkUnavailable.

As an example of a language that by many is considered ‘dirty’ but can be quite pleasant is Perl. Now, I know the previous sentence will cause a lot of controversy, so rather than fend off the pitchforks myself, I’ll refer you to Dave Cross, a proper Perl expert who very eloquently discusses this very topic.

Good code is easily read and understood, in part and in whole

Good code is easily read and understood, in part and in whole, by others as well as the author in the future (Trying to avoid the ‘Did I really write that?’ syndrome). By in part I mean that if I open up some module or function in the code I should be able to understand what it does without having to also read the entire rest of the codebase. Code that constantly references minute details that affect behavior from other (seemingly irrelevant) portions of the codebase is like reading a book where you have to reference the footnotes or an appendix at the end of every sentence. You’d never get through the first page! Some other thoughts on ‘local’ readability:

    • Well encapsulated code tends to be more readable, separating concerns at every level.
    • Names matter. Activate system 2 and put some actual thought into names, the few extra seconds will pay dividends,
    • Cleverness is the enemy. When using fancy syntaxes such as list comprehensions or ternary operators be careful to use them in a way that makes your code more readable, not just shorter.
    • Consistency in style, both in terms of how you place braces but also in terms of operations improves readability greatly.
    • Separation of concerns. A given project manages an innumerable number of locally important assumptions at various points in the codebase. Expose each part of the codebase to as few of those concerns as possible. Say you had some kind of people management system where a person object may sometimes have a null last name. To somebody writing code in a page that displays person objects, that could be really awkward! And unless you maintain a handbook of ‘Awkward and non obvious assumptions our codebase has’ (I know I don’t) your display page programmer is not going to know last names can be null and is probably going to write code with a null pointer exception in it when the last name-being null case shows up. Instead handle these cases with well thought out APIs that different pieces of your codebase use to interact with each other.

Good code has a well thought out layout & architecture to make managing state obvious

Sublemma: State is the enemy. It is the single most complex part of any application and needs to be dealt with very intentionally. Common problems include database inconsistencies, partial UI updates where new data isn’t reflected everywhere, out of order operations, or just mind numbingly complex code with if statements and branches everywhere leading to difficult to read and even harder to maintain code. Putting state on a pedestal and being extremely consistent and deliberate in how state is accessed and modified dramatically simplifies your codebase. Some languages, Haskell for example, enforce this at a programmatic level. You’d be amazed how much the clarity if your codebase can improve if you have libraries of pure functions that access no external state, and then a small surface area of stateful code which references the outside pure functionality.

Good code doesn’t reinvent the wheel, it stands on the shoulders of giants

In 2015 we have a wealth of tools available to solve common problems — depend on them as much as possible so you can focus on solving the interesting part of your application. Think ahead of time of somebody has solved some portion of the problem you’re trying to solve. Is there something on the interwebs/github (with an appropriate license) that you can reuse? Yes, expect to make heavy modifications, but more often than not this is a time saver. Things you should not be reinventing in 2015 in your project (unless these ARE your project)

Databases.

I don’t care what your requirements are, it exists. Figure out which of CAP you need for your project, then chose the database with the right properties. Database doesn’t just mean relational databases (e.g. MySQL) anymore, you can chose from any one of a huge number of data storage models:

    • Key Value Stores e.g. Redis, Memcache
    • “No-SQL” e.g. MongoDB, Cassandra/
    • Hosted DBs: AWS RDS / DynamoDB / AppEngine Datastore
    • Map Reduce engines: Amazon EMR / Hadoop (Hive/Pig) / Google Big Query
    • Even less traditional: Erlang’s Mnesia, iOS’s Core Data

Data abstraction layers

You should, in most circumstances, not be writing raw queries to whatever database you happen to chose to use. There exists a library to sit in between the DB and your application code, separating the concerns of managing concurrent database sessions and details of the schema from your main code. At the very least you should never have raw queries or SQL inline in the middle of your application code, please wrap it in a function and centralize all the functions in a file called ‘queries.py’ or something else equally obvious. A line like users = load_users() is infinitely easier to read than users = db.query(‘SELECT username, foo, bar from users LIMIT 10 ORDER BY ID’) etc. Centralization also makes it much easier to have consistent style in your queries, and limits the amount of places to go to change the queries should the schema change.

Other

    • Between S3, EBS, HDFS, Dropbox, Google Drive, etc. you really shouldn’t spend much effort or mental energy on storage now a days.
    • Take your pick of queueing services provider — ZeroMQ/RabbitMQ/Amazon SQS

Don’t cross the streams!

There are many good models for programming design, pub/sub, actors, MVC etc. Choose whichever you like best, and stick to it. Different kinds of logic dealing with different kinds of data should be physically isolated in the codebase (again, this separation of concerns concept and reducing cognitive load on the future-reader). The code which updates your UI should be physically distinct from the code that calculates what goes into the UI.

Conclusion

This is by no means an exhaustive or the perfect list of Good Coding Commandments. That said, if every codebase I ever had to pickup in the future followed even half of the concepts in this list I will have many fewer gray hairs and add an extra 5 years on the end of my life and the world would be a better place.

Credit to Scott Kyle (appden) for assistance reviewing the material in this post. Have you gotten Current For Mac yet?

The Product Market Fit Flow Chart (or PMFFC for short)

The goal of every startup, or any company producing a product that has some kind of ‘customer’, early on should be to find that magical ‘product market fit’ (herein PMF because I’m lazy). According to Wikipedia:

Marc Andreessen was the first person that used the term: “Product/market fit means being in a good market with a product that can satisfy that market.”

I was recently being recruited by a company, but turned them down because of what I saw was a big business mistake. They had PMF, but they weren’t doing anything with it.

Allow me to introduce what I believe should be on a laminated pamphlet and given to every entrepreneur in the valley:

PMF

The bit that was missing in the company I was talking to was the Run With It stage. By that I mean make your customers happy and as quickly as possible get as many as possible! This particular founder told me about the customers he already had, and about his 8-9 month roadmap to hire engineers and build product. I asked how many sales and support people he would hire in that time, he said zero.

By the roadmap he was suggesting in 2016 they would be taking in maybe ~$10k monthly, have an engineering team of ~10, a sales team of 1 (the founder) and an annual burn of >$1MM. An alternative reality, would be by 2016 to have an engineering team of 5, a sales and support team of 3-5, taking in $100K monthly and be on the verge of break even and able to raise a kickass series A.

I’m a technical person by training, and 5 years ago I’d probably be making the same mistake as this founder.  It’s really unnatural to think about sales, and it’s easy to think of them as unskilled workers you can pickup by the dozen at a moments notice.  In reality, this couldn’t be further from the truth.  Sales is hard, very hard, and you need to be working on it as early as possible.  I consider it one of the most important lessons I’ve learned as an entrepreneur to never undervalue or under-prioritize your sales and distribution strategy.  Once you’ve achieved PMF, or are even within long distance sonar range scanners of PMF, you should be thinking really hard about how to sell the bananas out of what you’re building.

Vietnam Bike Tour: Hanoi to Ho Chi Minh City (Day 9)

DAY: 09Screenshot from 2014-10-15 08:43:39
Distance: 308KM
Origin: Da Lat
Endpoint: SAIGON!!!!!
Pho Consumed: 1 each for lunch
Today’s Author: Zach

Screenshot from 2014-10-15 08:43:31

Well, today’s the day! We estimate its roughly 290-300KM to Saigon, which should be doable in one day, we think. Both of our bottom end’s have been pretty painful lately due to the number of hours riding, so we’ve taken to starting the morning with a dose of advil to ease the discomfort while riding which of course requires food.  Last time we were in this predicament we had some of our “emergency rations.”  This time, though, the hotel/guesthouse we’re staying at offered us a complimentary breakfast — they made fresh eggs, had fresh bread and lots of nicely sliced fruit. I tried Marmite for the first time — good god the stuff is awful. Absoultely disgusting. Worse than China’s “stinky tofu”.

As we were packing up the lady at the hotel (whose English was impeccable — a definitely surprise for us) asked where we were headed. I told her our goal was to make it to Ho Chi Minh city today, and she said “no no, too far, need 2 days.” Uh oh. Not a good sign. Hopefully we can prove her wrong, as neither of us was really looking forward to spending another night in a middle-of-nowhere-sketchy-motel.

We got on the road around 9:45AM. We only rode about 200km the day before so were not urgently pressed for gas, so we simply got started It took less than 10 minutes to leave town and shortly after that we were back in the wilderness, driving down the steep mountain that Da Lat is based on. The roads were in pretty reasonable shape on the way down — nothing too remarkable.

We got to the bottom of the mountain and ended up on this really very nice, properly divided 4 lane highway. It felt rather out of place given everything else we’de seen in the country so far. Even stranger, it soon divided into a “cars only” and “motorbikes only” roads (two completely separate roads). Normally we’re actually going faster than the cars on the road (at a whopping 90KM/h), so it might’ve made practical sense to go on the car highway, but given that it’s the last day we didn’t want to get into any unecessary trouble, so we ended up on the bike lane.

Shortly later, about 20KM into the day, we stopped for gas. I was looking at the map at the gas station and realized that the other end of the “car boulevard” was actually the Da Lat airport. It all makes sense now! The wealthy folk land at the airport and get a basically traffic-free ride straight into one of the nicest towns in the country.

Thankfully neither bike was short on oil and we quickly got back on the road. The highways were pretty typical fair — with a mix of rural open highway without much traffic, and towns at regular intervals with tons of traffic (read: dodging semitrucks coming straight for you and forcing us into the dirt, which we got very good at riding in). South Vietnam definitely feels denser than the north — the distance between the towns felt smaller and each town felt larger, though the towns were still pretty typical fair — mechanics, Pho restaurants, tiny stores selling the same soda/tea and unlabeled shantybuildings.

At about 140KM, 1PM we stopped for lunch at — you guessed it, a random tiny Pho place. This place was covered in dogs. At least 10 dogs, most of which looked to be puppies were running around. None of them were particularly clean looking or well groomed. Aside from moan about the heat and how soar our butts were we debated whether or not these dogs were pets, or food. We didn’t have the heart to use the translator to ask, so it’ll remain a mystery for now. By 1:30 we were back on the road with what Google said was 148KM to go.

 

Somewhere in this last stint we came upon stopped traffic. The road was completely blocked. Half of the road was stopped by a giant flatbed truck with big concrete cylendars on it (presumably some kind of municiple plumbing project) and the other half was a truck whose mirror couldn’t clear the concrete tubes on the flatbed. For reasons we still don’t understand, the truck couldn’t back up, and turn the wheel left and move over 6 inches to clear the flatbed. Instead he simply stood there and all of traffic waited and watched the flatbed crane move each of the cylendars off the flatbed and onto the ground so the truck could pass. What insanity!

At about 50KM outside the city we reached a fork in the road. Things were starting to feel a bit denser and it felt as if we were on the outskirts of the city, which was really an incredible feeling. The map took us in a way we didn’t expect, which turns out to be really nice. For about 25KM we had nothing but straight, 4 line highway with proper dividers (so nobody turns into you) and were able to do a solid 100KM/h without the constant thread of kamikaze-motorbiker or passing-in-oncoming-traffic-18- wheeler-going-to-turn-you-into-pancake.

After the nice highway we knew we were in Saigon. How? Because there was traffic. Not like many motorbikes in my way traffic, like bumper to bumper cars going slowly on a highway traffic. First time we’de seen that on the journey! It was slow going, and we were forced to breath in a lot of diesel fumes, but we eventually made it to the hotel in Ho Chi Minh City. WAHOO!

We dropped all of our stuff in the room and headed back out on the bikes for the last time. We drove into the heart of the city and got to see upclose how different Saigon is from the rest of Vietnam. My first reaction was “this isn’t Vietnam, there’s Fancy Stuff here!”. Mercedes, BMWs, Rolls Royce cars, Starbucks! Oh my.

We had some difficulty finding where to drop off the bikes as Google’s pin for the address was (fairly typically for this part of the world) about 8 blocks off. We did however find it eventually! They did a quick inspection of the bikes and discovered Felix’s broken rear suspension when trying to move the bike by hand. They picked up the rear of the bike and bits of the shock absorber literally fell out. That’s not supposed to happen!

A quick call to Mr. Hung and we got the final bill. Cost for the new wheel and the bike’s suspension: $96. Not too bad all and all. We settled everything, got our deposits back and sat in a taxi back to the hotel. Oh what a feeling to be on four wheels and have air conditioning again! We took a quick dip in the pool to begin the de-stinkification process. Followed by showers (long ones, with lots of soap and scrubbing). We got a few restaurant recommendations from friend’s and headed over to “District one” via taxi. The first restaurant we wanted to go to (Saffron) was actually so full and busy that they told us “sold out for the night!” That’s a new thing.

We tried the sister restaurant (Italian) which only had a 10 minute wait. Still unusual, but we didn’t mind waiting 10 minutes. I’ll spare you most of the dining details, suffice is to say the food was delicious and up there as one of the best Italian restaurants we’ve ever been to. Including a giant 2 foot wheel of parmasean that had its top layer scraped off and then Felix’s pasta literally coated in the fresh stuff. Total bill was $35 a person, including appetizers and wine. Expensive for Vietnam, but very cheap for the quality anywhere else in the world!

Made it back to the hotel and we’re both promptly ready to pass out after a long Journey!

Stay tuned for the epilogue, including some data about our trip and retrospectives!

2014-05-21 20.50.55

Vietnam Bike Tour: Hanoi to Ho Chi Minh City (Day 8)

DAY: 08Screenshot from 2014-10-15 07:40:46
Origin: Nha Trang
Destination: Dalat
Distance: 140km
Pho Consumed: 2
Today’s Author: Felix
We woke early for our diving trip, heading downstairs to the dive center around 7am (convenient, right?).  Since the bus wasn’t scheduled to depart until 7:30, we headed across the street to a restaurant with Russian signage, which seemed to be the only place open on our street. We had a decent western-style breakfast, but unfortunately had one of our only poor service experiences in Vietnam; the food took nearly 30 minutes to appear, even though we had simply ordered some coffee and toast.  After trying (and apparently failing) to impress upon the staff our need for expediency, we finally made it back to the dive center around 7:45.

Awaiting us were our companions for the day, a handful of French tourists and Natalie, a Malaysian-Canadian girl on the second day of a dive certification class.  Fortunately everyone seemed cool with our tardiness, and we all piled into a minibus and headed for the dock.
Arriving at the dock, we hopped on our boat and met the rest of the dive crew as we headed away from shore.  Our divemaster for the day, Nguyen, was a chill dude who spoke pretty good English, and we felt that we were in good hands.  During the 40 minute ride to the dive site, we got a refresher on some basic scuba skills since Felix hadn’t dived in a number of years and Zach isn’t certified.

Arriving just off the coast of a small island, we were pleased to see that the water was almost perfectly clear.  After getting geared up, we followed Nguyen and hopped into the water.  We spent the first 10 minutes or so reviewing our basic scuba skills (hand signals, mask clearing, buoyancy control, etc.) and headed off after Nguyen to explore the reefs.  We were also accompanied by a cameraman who did a bang-up job of capturing our adventures.

While perhaps not the most amazing diving in the world, the reefs were teeming with life.  Although we stayed in relativey shallow water (~8-10m), we saw a host of incredibly colorful smaller fish, as well as a plethora of anenomes and coral.  We also played with a giant jellyfish (!), which was apparently safe to touch as long as you stayed away from the tentacles on the bottom.

After about 50 minutes, we headed back to the boat and enjoyed a light snack of mangoes and baguettes.  Felix also sampled a proffered Vietnamese cigarette – apparently the “White Horse” brand is produced locally in Nha Trang.  We then geared back up and headed down for another dive.
On our second dive, we went a bit deeper – perhaps 12 or 15 meters.  For the most part, the marine life we saw was pretty similar to our first dive. We did briefly glimpse a school of larger fish about a dozen meters away, but they swam off fairly quickly.

Returning to the boat once more, we hung out for a bit as we waited for the rest of our group to return before we headed back towards the city.
Back on land once more, we piled back into our minibus and headed over to a small cafe for lunch, which was included in the price of our trip for the day.  We enjoyed some pretty tasty noodles accompanied by mystery meat, but we trusted our French tour leader and chowed down (apparently the cafe is owned by his Vietnamese wife).  The noodles were followed by crepes slathered in mango sauce for dessert, which were delicious if a bit too saucy.  We also chatted with Natalie about her traveling adventures throughout Asia, which was a nice change from our usual mealtime discussions of tech, business, and motorcycle-related soreness.

We headed back to the dive center / guesthouse, took quick showers, and packed up our bikes.  After settling our bill and getting directions to a Honda mechanic, we set off.  Unfortunately, it turns out that our bikes (XR150s) weren’t sold by Honda in Vietnam, and the dealership was thus very confused about our request for maintenance services.  We got directions to another garage in the more “Vietnamese” area of the city and headed off in that direction, but were unable to sufficiently explain our issue to the mechanic (Felix’s headlight was exceptionally dim).  Giving up, we gassed up and left the city around 4.
Heading towards Dalat, we were greeted by a new road in relatively good condition, as well as some very threatening stormclouds in the distance.  However, we got incredibly lucky and had some of the best riding of the trip; there was almost no traffic, the roads were both excitingly twisty and in pretty good shape, the scenery was incredible, and the temperatures dropped rapidly to a pleasantly cool level.

We passed a number of natural waterfalls, as the storm had just passed and there was plenty of water on the ground and in the hills.  Rapidly gaining elevation, we were treated to some incredible vistas over the mountains as the sun was beginning to set.
After a couple of hours of really outstanding riding, dark began to fall and we started to get a bit chilly.  On our way into the city of Dalat, we passed a dozen or so kilometers of lit up greenhouses, which made for some very pretty scenery.  We also began to notice large, American-style McMansion type homes; Dalat is clearly a city favored by wealthier Vietnamese.  This became more apparent as we entered the city proper; compared to much of the rest of Vietnam, the city is relatively well-kept and well-developed.  Near the center of town is a charming small lake, and nearby is a massive golf club.
We arrived at a small hotel recommended by Lonely Planet, the Dreams Hotel, and were pleasantly surprised to find that the elderly couple who run the place were both very friendly and spoke decent English.  We were also shocked to hear their adorable grandchildren running around the lobby speaking perfect English without a hint of accents whatsoever.
After checking in to the room and availing ourselves of the modern shower facilities, we headed down the street to grab some dinner.  We quickly stumbled upon a pizzeria that seemed promising and enjoyed some surprisingly decent pies (“Four Cheese,” “Mexican,” and Margherita).

Following dinner, we decided to wander around a bit.  After passing a number of bustling restaurants, bars, and shops, we happened upon a fancy-looking bakery.  As we hadn’t gotten Zach any cake for his birthday nor had dessert yet, we popped in and, after a quick glance at the prices, started piling a tray with refined carbohydrates (plus two party hats, because why not?).  Total bill for our sugary indulgences?  About $5.  What an awesome country!
Heading back to the hotel, we dove into our haul.  Zach put his friend Leona on video chat so she could see us in our party hats stuffing our faces with cake and sweets in the middle of Vietnam.  After polishing off most of our dessert, we headed to bed after what was clearly one of the best days of the trip.

Vietnam Bike Tour: Hanoi to Ho Chi Minh City (Day 7)

Day: 07Screenshot from 2014-10-15 07:11:08
Distance: 287km
Origin: Binh Duong
Destination: Nha Trang
Pho Consumed: 1
Today’s Author: Felix

Screenshot from 2014-10-15 07:11:02

Somehow missing our alarm, we still managed to be up by 8.  In the interest of expediency, we had the rest of our emergency-ration faux-moon-pies for breakfast.  Leaving the hotel at 9:55, we gassed up and did a quick maintenance check on the bikes before we hit the road.  The road along this stretch was in pretty good condition, and we passed a few gigantic golden buddhas and temples as well as some very picturesque beaches.  We covered about 130km before stopping for lunch at a small shop in a seaside town and grabbing a bowl of Pho.

Heading back onto the road, we passed more beaches and reached Nha Trang around 4:15 in the afternoon.  On our way into the city, we passed through the non-touristy part of town, went past a gigantic temple of some sort and over a nice bridge surrounded by colorful fishing boats.
2014-05-19 15.56.54
Video 1:
Video 2:
Nha Trang is much more developed than anywhere else we’ve been in Vietnam, with high-rise luxury hotels lining the pristine beach.  The streets are crowded with Westerners frequenting the many hotels, bars, restaurants, and so forth that are all designed to cater to them.  One thing we were surprised by was the prevalence of Russian signage – something we haven’t seen anywhere else in the country.  We later learned that the city is a major vacation destination for Russian tourists, with as many as 10 direct flights from Moscow arriving daily.
We checked in at Angel Dive Center and Guest House and made arrangements for our scuba outing the following day.  After cleaning ourselves up a bit, we headed down the block to a high end beach bar and restaurant for some birthday drinks for Zach.  The prices were ludicrous, especially for Vietnam, but the setting was incredible and the drinks were delicious.

Heading inland a couple of blocks, we had dinner at a nice Vietnamese place called Lanterns which was very highly rated on Tripadvisor and consequently packed with tourists.  To one side of us were a pair of very creepy Russian gentlemen (one older and one younger), and to the other was a large group of Korean sailors on shore leave.  The food was pretty good, and we hit up an extremely legit gelato shop for dessert – doing the tourist thing certainly has its advantages!