📚Collection of books, research papers, videos and articles for mastering Site Reliability Engineer proficiency.
- Site Reliability Engineering: How Google Runs Production Systems
- Site Reliability Engineering: The Site Reliability Workbook
- Building Secure & Reliable Systems
- Docker up and running
- Kubernetes Up and Running By Brendan Burns, Kelsey Hightower, Joe Beda
- Microservices in Production
- Designing Data-Intensive Applications
- Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services - Free to download
- Software Engineering at Google - Free to download
- Modern Operating Systems Tanenbaum, Andrew S.
- UNIX and Linux System Administration Handbook Nemeth, Evi
- TCP/IP Illustrated, Volume 3: TCP for Transactions, HTTP, NNTP, and the Unix (R) Domain Protocols Stevens, W. Richard
- Systems Performance: Enterprise and the Cloud
- The datacenter as a computer: an introduction to the design of warehouse-scale machines
- The Practice of System and Network Administration
- The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems
- Linux Server Hacks: 100 Industrial-Strength Tips and Tools Flickenger, Rob
- Web Operations - Keeping the Data On Time
- The Linux Command Line Jr., William E. Shotts
- Shell Scripting: How to Automate Command Line Tasks Using Bash Scripting and Shell Programming
- The Go Programming Language Donovan, Alan A. A.
- Think Python Downey, Allen B.
- Programming Pearls Bentley, Jon L.
- Code Complete 2, Steve McConnell
- Time Management for System Administrators
- Large-scale cluster management at Google with Borg
- On designing and deploying internet-scale services
- Mesos: a platform for fine-grained resource sharing in the data center
- Google: Reliable Cron across the Planet
- Kubernetes
- CNCF landscape
- Aurora
- Docker
- Fluentd
- ElasticSearch
- Hadoop
- Mesos
- Kernel Based Virtual Machine
- Spark
- VMWare
- Software engineering at Google
- Keys to SRE by Ben Treynor
- How Container Clusters Like Kubernetes Change Operations
- 10 Years of Crashing Google
- Release Engineering Best Practices at Google
- From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
- Transactional System Administration Is Killing Us and Must be Stopped
- Lessons Learned From Scaling Uber To 2000 Engineers, 1000 Services, And 8000 Git Repositories
- Netflix: 190 Countries and 5 CORE SREs
- Performance Checklists for SREs
- Notes on SRE book
- SYSADMIN (Un)Reliability Budgets