System Reliability Engineer
The System Reliability Engineering team is responsible for the availability, performance and security of our products as well as their production and development environments and all the services that support them. The members of this team work together with colleagues not only from their own team but from other cloud services teams from around the world. The three most important characteristics of this team are open communication, teamwork and technical excellence.
The Cluj SRE team is composed of 8 members with different levels of seniority.
We are strongly oriented toward delivering the highest availability possible for our websites and services and low response times. For this, we’ve implemented rigorous monitoring and alerting platforms and enabled a 24/7 On-Call rotation. Furthermore, we strive to automate most of our work and thus eliminate everyday toil. In order to achieve all these, we often find ourselves experimenting with cutting-edge tools and concepts or implementing additional technologies or ideas into our existing infrastructure.
This role is not available in a permanently remote regime, therefore the availability to partially work from the Cluj office in the foreseeable future is required in order to continue the recruitment process.
What we're looking for:
- 2+ years of experience with Unix/Linux and/or Windows operating systems in a production environment;
- 2+ years of experience maintaining and troubleshooting multiple application servers and web servers in a production environment;
- Medium or advanced knowledge of multiple virtualization technologies such as: VMWare, Hyper‐V;
- Basic to advanced knowledge of cloud computing platforms such as: GCP, Microsoft Azure;
- Basic to advanced knowledge of container orchestration tools such as: Kubernetes;
- Basic to advanced knowledge of one or more configuration management systems: Puppet, Ansible, Vagrant;
- Medium to advanced knowledge of one or more monitoring and alerting systems such as: Icinga2, Graphite, collectd, Prometheus/AlertManager and ELK;
- Basic to Advanced knowledge of maintenance of RDBMS systems such as: MySQL, Postgresql, Microsoft SQL Server, Oracle;
- Medium to Advanced knowledge of one or more scripting languages such as: bash, perl, python, PowerShell, ruby;
- Basic to advanced knowledge of one or more programming languages and know how to approach different types of implementations: compilation based or interpretation based;
- Basic to advanced knowledge of continuous integration or continuous delivery systems: Jenkins, GO.CD, TeamCity, Gitlab;
- Basic to advanced knowledge of versioning systems such as: SVN, GIT, TFS.
What you'll do:
- Keep production services up;
- Implement and maintain fault tolerance for all services;
- Optimize performance at all possible layers;
- Actively participate in daily standups and other meetings that you will be a part of;
- Perform maintenance as needed in production and development environments;
- Regularly analyze graphs and other monitoring tools and take action based on that analysis;
- Respond timely to alerts;
- Implement and maintain monitoring and alerting tools required to keep services highly available and performant;
- Automate and integrate configuration management wherever possible;
- Implement and maintain the necessary backup systems to ensure recovery;
- Collaborate openly and effectively with all teams;
- Research and propose new ways of improving our technology stack and workflow;
- Participate in the on‐call rotation as agreed with team members and manager and follow the necessary procedures while on‐call;
We also offer a competitive benefits package including:
- Highly competitive salary;
- Up to $2.000 educational reimbursement plan/calendar year;
- Comprehensive private medical and life insurance plan;
- Flexible work schedule;
- Modern technology, work methods and tools;
- Possibility to participate in various training programs and conferences;
- 20 RON/day meal tickets;
- Access to company vacation condos in the U.S.;
- Frequent internal engagement activities;
- Discount plans for fitness centers, restaurants and various service providers.
You will also be able to use the benefits below, once we return to our office:
- Commuting and parking benefits;
- Fresh fruit, snacks and coffee in the office;
- Open air terraces, game rooms;
- Regular themed parties & team building activities.