The following is an outline of some best practices for web apps. The first is from Jesper Andersen’s How to build a stable system, and that led me to Adam Wiggins’ Twelve-Factor App.
Notes from How to build stable systems – an incomplete opinionated guide
- Developer controls software; nobody else
- When bugs occur, fix them
- When deadlines approach, meet them
- When deadlines pass, deploy what works, roll back what doesn’t
- Small, controllable, deploy-able units of work
- do fewer things and do them well
- Any change to the software is succinct and moves it from one stable state to another
- have a plan to transition from one system to another
- a project has a known done state
- no more than 6 on a project
- a project isn’t longer than 2.5 months
- a project start with 24-72 hours (a seed) of concentrated work
- if a seed fails, abort the project
- a project has a single gamble only
- a project has a list of ‘things we don’t solve in this project’
- test core components more than leaf components
- run experiments before starting projects
- research the quality of code and data you interact with
- people can develop from any physical location at any time
- prefer asynchronous communication systems
- create hiding spots
- the system is built for production
- a flat set of modules with loose coupling
- each module has one responsibility, but isn’t microscopic
- use protocols, so any party can be changed
- reducing dependencies is probably more valuable than avoiding duplicate code
- in a comm chain, the endpoints have intelligence and the intermediaries just pass
- any opaque blob of data is accepted and passed on
- intermediaries don’t parse or interpret data
- have a supervision/restart strategy
- use ratcheting via idempotence
- from a known stable state, attempt the next computation, if it succeeds, verify stability, if it fails, try again
- stateless between computations, stateful on the ratchet flansk
- unique IDs on all messages with a log
- the system always catches up to a point in time, so there’s no difference between on-line and off-line processing
- define the target capacity up front
- SETUP
- first build an empty project
- add empty project to continuous integration
- deploy empty project to staging
- if no users, deploy empty project to production
- when this works, start building application
- continuous integration produces the artifact (built, self-contained code with no reliance on host environment save simple setup)
- avoid external dependencies when deploying
- the same artifact is deployed to staging and production; it picks up environment context which configures it
- the config file can be on disk, consul, etcd, DNS, or downloaded from S3 (err for simplicity)
- the artifact is reproducible; lock dependencies to versions; vendor everything
- the artifact contains everything for running the software and has information for the deployment system
- try to make production deploy take <1 minute
- default library in every application contains
- debugging utilities
- tracing utilities
- tools to gather and export metrics
- ways for the app to become a bot in a chat network
- etc
- correctness, elegance, and quality are more important than fast
- fast is not really important
- define ‘good enough’ and don’t optimize beyond id
- measure before and after optimization
- build system to collect metrics about itself as it runs; ship to central point
- use HdrHistogram
- use any way you know to catch early bugs (tooling). unit test, property based test, type systems, static analysis, profiling, etc
- build to run on multiple environments, preferably UNIX
- code formatting discussions are stupid; just define what ends up working for you as the standard
- isolate the error kernel?
- use load regulation on the border, not inside
- prioritize service to a few if it can’t be for all
- use circuit breakers
- ed(1) is the default text editor
- postgresql is the default database (unless dataset >10TB)
- “If you need MongoDB-like functionality you create a jsonb column. You learn about your data while they are in postgres and move them out when you have learned. Postgres is your authoritative storage. You export to elasticsearch from postgres. You preheat your other data stores from postgres. Until load increases you will run read replicas of postgres. You will use pg_bouncer.”
- isolate complex transactional interactions to a few parts, usually having to do with money. try to use idempotent ratcheting instead
- user Erlang for robust, stable operation
- if you don’t pick Erlang, you’ll have to re-implement Erlang’s ideas in your language
- avoid writing everything in one language; use many languages to take advantage of each one’s strengths
- have a way to automatically deploy
- use GNU Make(?) to build everything
- the artifact comes with a default configuration and everything that needs to change is picked up from the context (environment variables)
- persistent data lives outside of the artifact path, on a dedicated disk with dedicated quota
- the app logs to a default location and the log never exceeds a set amount of space
- the artifact path is not writable by the application
- use different credentials in production and staging
- isolate the staging and production networks
- deny developers easy access to production environment
- avoid early etcd/Consul/chubby for a file on S3 downloaded at boot until you need fully dynamic configuration systems
- optimize for sleep; avoid waking people up in the middle of the night at all costs
- gracefully degrade; a partially responsive system still “works”
- run in monit, supervise, upstart, systemd, rcNG, SMF, etc; never in tmux/screen.
- the OS gracefully restarts the app for a while before giving up
- consider a split stack of your own software away from the system stack
- it is always safe to kill the application
- app must gracefully stop and start
- booting off active requests is not an option
- stop internal parts in the opposite order they were started
- there should be enough information shipped that you an reconstruct the error without production acess >90% of the time
- developers never log into production hosts
- every log file is shipped out of the system
- every interesting metric too
- developers work on staging hosts
- metrics often lead failures
- the only way to make changes to a production host and/or staging host is to redeploy
- if you want to deploy to production many times a day, then have a group of stable hosts on hand for a rollback
- Docker is not mature (Feb 2016) so avoid it in production
Notes from The Twelve-Factor App
- In the modern world software is commonly delivered as a service: called web apps, or software-as-a-service. Twelve Factor is for web apps that:
- use declarative formats for setup automation
- have a clean contract with the underlying OS
- are suitable for deployment on modern cloud platforms
- minimize divergence between development and production, enabling continuous deployment
- can scale up without significant changes to tooling, architecture, or development practices
- I Codebase
- one codebase tracked in revision control (Git, Mercurial, Subersion, etc), many deploys
- code repository/code repo/repo = a copy of the revision tracking database
- codebase = any single repo (if centralized) or any set of repos with root commit (if decentralized)
- distributed system = multiple codebases
- if multiple apps need the same code put it into a library
- a deploy = a running instance of the app (production, staging, each developer, etc)
- all deploys use same codebase, but they can have different versions active (commits not yet deployed to next level)
- II Dependencies
- explicitly declare and isolate dependencies
- never rely on implicit existence of system-wide packages
- dependency declaration manifest = declare all dependencies, completely and exactly
- dependency isolation tool = ensures nothing leaks in during execution
- declaration and isolation must be used together (Ruby has gemfile and bundle exec, Python has -Pip and Virtualenv, C has Autonconf and static linking)
- never rely on the implicit existence of system tools (shelling out to ImageMagick or curl)
- III Config
- store config in the environment
- config = everything likely to vary between deploys (not internal application config such as config/routie.rb in Rails because it doesn’t vary between deploys)
- resource handles to the database, Memcached, other backing services
- credentials to external services like Amazon S3, Twitter
- per-deploy values like the canonical hostname for the deploy
- do not store configs as constants in the code
- do not use config files
- do not batch config into named groups
- store config in environment variables/env var/env
- easy to change between deploys without changing code
- little chance of being checked into repo accidentally
- language and OS agnostic
- env var = granular controls, each fully orthogonal to others; never grouped
- IV Backing Services
- treat backing services as attached resources
- any service the app consumes over the network as part of normal operations
- datastores (MySQL, CouchDB)
- messaging/queueing (RabbitMQ, Beanstalkd)
- SMTP (Postfix)
- caching (Memcached)
- metrics (New Relic, Loggly)
- binary assets (Amazon S3)
- APIs (Twitter, Google Maps, Last.fm)
- etc
- make no distinction between local and third party services; access both via URL or other locator/credentials stored in the config
- resource = distinct backing service (two MySQL databases are two resources)
- attach and detach at will
- V Build, release, run
- strictly separate build and run stages
- a codebase is transformed into a (non-development) deploy through three stages
- build stage = convert code repo into executable bundle
- release stage = combines build and config, ready for execution
- run stage (runtime) = launches app’s processes against a selected release
- every release should have a unique release ID (timestamp, increment number)
- releases cannot be changed; any change must create a new release
- builds are initiated by developers, but runtime execution can be automatic (server reboot, process manager restart)
- VI Processes
- execute the app as one or more stateless processes
- processes = how the app executes in the execution environment
- simplest case: code is stand-alone script on dev’s laptop launched via command line; complex case: production deploy of sophisticated app with many process types
- processes are stateless and share nothing
- any data that needs to persist must be stored in a stateful backing service (database)
- the memory space or filesystem of the process can be used as a brief, single-transaction cache (download a large file, operate on it, store result in database)
- never assume anything cached in memory or on disk will be available on a future request or job
- do not use sticky sessions (session state data can go into a datastore with time-expriation like Memcached or Redis)
- VII Port binding
- export services via port binding
- the twelve-factor app is completely self contained and does not rely on runtime injection
- the web app exports HTTP as a service by binding to a port, and listening to requests coming in on that port
- in a local dev env, the dev visits a URL like http://localhost:5000/ and in deployment a routing layer routes requests from a public-facing hostname to the port-bound web processes
- typically add a webserver library to the app (Tornado for Python, Thin for Ruby, Jetty for Java)
- nearly any kind of server software can be run via a process binding to a port and awaiting incoming requests (not just HTTP)
- by using port binding, one app can become the backing service for another app
- VIII Concurrency
- scale out via the process model
- in many cases, the running process(es) are only minimally visible to the developers of the app
- in twelve-factor, processes are first-class citizens (like the unix model for service daemons)
- assign types of work to process types
- individual processes can internally multiplex (threads in the runtime VM, async/evented as in EventMachine, Twisted, Node.js)
- process formation = the array of process types and number of each process
- never daemonize or write PID files
- rely on OS’s process manager (Upstart, cloud platform, Foreman in development) to manage output streams, respond to crashed processes, handle user-initiated restarts and shutdowns
- IX Disposability
- maximize robustness with fast startup and graceful shutdown
- processes are disposable = started/stopped at a moment’s notice
- processes shutdown gracefully when they receive SIGTERM from process manager (stop listening, finish, exit)
- for a worker process, return the job to the work queue (NACK in RabbitMQ, disconnect in Beanstalkd, release lock in Delayed Job)
- processes are robust against non-graceful shutdown, such as a queueing backend like Beanstalkd or crash-only design
- X Dev/prod parity
- deep development, staging, and production as similar as possible
- design for continuous deployment by keeping the gap between development and production small
- deploy in hours or minutes
- track code after production
- keep environments similar
- do not use different backing services between development and production
- install Memcached, PostgreSQL, RabbitMQ using Homebrew, apt-get
- use Chef, Puppet and light-weight virtual environments like Vagrant
- the cost of installing and using these is low compared to the benefit of continuous deploy
- XI Logs
- treat logs as event streams
- logs = stream of aggregated, time-ordered events collected from the output streams of all running processes and backing services; typically text with one event per line, no beginning or end
- twelve-factor app never routs or stores output streams; don’t write to or manage log files
- each running process writes its event stream, unbuffered, to stdout
- in staging or production, the execution environment collates and archives all streams (Logplex, Fluent)
- event stream can be routed to a file, or watched realtime in terminal, or sent to log indexing/analysis (Splunk, Hadoop/Hive)
- XII Admin processes
- run admin/management tasks as one-off processes
- database migration
- console (REPL shell)
- scripts
- one-off admin processes run in an identical env, against a release, using the same codebase and config
- admin code must ship with application code to avoid synchronization issues
- use dependency isolation
- twelve-factor favors languages which provide a REPL shell out of the box and make it easy to run one-off scripts
- in development, use direct shell command inside the app’s checkout directory; in production, use ssh
- run admin/management tasks as one-off processes
Be First to Comment