Kubernetes

Site Reliability Engineering Manager (GCP)

Role Overview
We are looking for an experienced Site Reliability Engineering (SRE) Manager to lead a team of highly skilled SREs in managing, automating, and optimizing our cloud infrastructure on Google Cloud Platform (GCP). The SRE Manager will be responsible for ensuring the reliability, availability, and performance of critical services while driving automation and operational excellence having 8+ years of experience.
As an SRE Manager, you will work closely with development, infrastructure, and security teams to implement scalable, resilient, and high-performance solutions. This role is ideal for someone passionate about reliability engineering, cloud automation, and observability.
Key Responsibilities:

Leadership & Team Management
• Lead, mentor, and grow a team of Site Reliability Engineers, fostering a culture of innovation, collaboration, and continuous learning.
• Define and drive SRE best practices, focusing on reliability, automation, monitoring, and incident response. • Collaborate with development, DevOps, and security teams to align infrastructure and application reliability with business objectives.
• Own SRE roadmap and strategy, ensuring alignment with organizational goals and industry best practices.
Reliability & Performance
• Ensure the uptime, availability, and performance of critical applications hosted on GCP.
• Implement SLOs (Service Level Objectives), SLIs (Service Level Indicators), and SLAs (Service Level Agreements) to measure system reliability.
• Conduct root cause analysis (RCA) for production incidents and drive post-mortems to improve system resilience.
Automation & CI/CD
• Automate infrastructure management using Infrastructure-as-Code (IaC) tools such as Terraform or Pulumi. • Improve CI/CD pipelines using GitOps methodologies to enable faster and reliable deployments. • Champion self-healing architectures to minimize manual intervention.
Observability & Incident Management
• Implement and enhance monitoring, logging, and alerting using tools like Prometheus, Grafana, Stackdriver (Cloud Monitoring), and Open Telemetry.
• Develop on-call rotations, runbooks, and incident management processes to minimize downtime and improve MTTR (Mean Time to Resolution).
• Use AI/ML-based anomaly detection for proactive monitoring.
Security & Compliance
• Ensure security best practices for IAM, networking, and data encryption within GCP.
• Conduct security audits and work with compliance teams to ensure adherence to SOC2, ISO 27001, HIPAA, or other

regulatory frameworks.
• Implement zero-trust security models and automated compliance policies.
Cost Optimization & Capacity Planning
• Optimize cloud costs using GCP cost management tools, rightsizing, and auto-scaling.
• Implement capacity planning strategies to balance cost and performance.
• Work with finance teams to forecast infrastructure costs and optimize spend.
Required Skills & Qualifications:

Technical Skills
• Strong expertise in Google Cloud Platform (GCP) services such as GKE, Cloud Run, Cloud Functions, Cloud SQL • BigQuery, and Cloud Spanner.
• Hands-on experience with Terraform, Pulumi, or Cloud Deployment Manager for Infrastructure-as-Code (IaC). • Experience with CI/CD tools like GitHub Actions, ArgoCD, Spinnaker, or Jenkins.
• Strong knowledge of Kubernetes (GKE) and container orchestration.
• Experience with SRE principles such as error budgets, chaos engineering, and observability. • Strong scripting and automation skills in Python
• Experience with monitoring and observability tools (Stackdriver, Datadog, Prometheus, Grafana, New Relic).
Leadership & Soft Skills
• Proven experience managing and mentoring SRE teams.
• Strong problem-solving skills with the ability to troubleshoot complex production issues. • Ability to work in a fast-paced, DevOps-oriented environment.
• Strong communication and stakeholder management skills.
• Experience collaborating with cross-functional teams, including engineering, security, and product teams.
Preferred Qualifications

• GCP Professional Cloud Architect or GCP Professional DevOps Engineer certification.
• Experience with multi-cloud or hybrid cloud environments.
• Hands-on experience with serverless computing and event-driven architectures.
• Prior experience in high-traffic, distributed systems.

Site Reliability Engineering Manager (GCP) Read More »

Principal Analyst – MLOps Engineer

Role Overview
We are seeking a highly skilled Senior MLOps Engineer with 8+ years of experience to join our team. The ideal candidate will have extensive expertise in model deployment, model monitoring, and productionizing machine learning models. You will play a crucial role in designing and implementing efficient workflows for AI programming and team communication, ensuring seamless integration of ML solutions within our organization.
Key Responsibilities:
• Workflow Design & Implementation: Oversee the implementation of workflows for AI programming and team communication, ensuring optimal collaboration and efficiency.
• Model Deployment: Manage and optimize model deployment processes, including the use of Kubernetes for containerized model deployment and orchestration.
• Model Registry Management: Maintain and manage a model registry to track versions and ensure smooth transitions from development to production.
• CI/CD Implementation: Develop and implement Continuous Integration/Continuous Deployment (CI/CD) pipelines for model training, testing, and deployment, ensuring high code quality through rigorous model code reviews.
• Model Monitoring & Optimization: Design and implement model inference pipelines and monitoring frameworks to support thousands of models across various pods, optimizing execution times and resource usage.
• Team Leadership & Training: Manage, mentor, and train junior engineers, fostering their growth and learning while overseeing a large team
• Collaboration with Data Science Teams: Train and collaborate with data science team members on best practices in tools such as Kubeflow, Jenkins, Docker, and Kubernetes to ensure smooth model productionization.
• Reusable Frameworks Development: Draft designs and apply reusable frameworks for drift detection, live inference, and API integration.
• Cost Optimization Initiatives: Propose and implement strategies to reduce operational costs, including optimizing models for resource efficiency, resulting in significant annual savings.
• Documentation & Standards Development: Produce MLE standards documents to assist data science teams in deploying their models effectively and consistently.
Qualifications:
• 8+ years of experience in MLOps, model deployment, and productionizing machine learning models.
• Proficient in Kubernetes, model monitoring, and CI/CD practices. Experience working in the Azure environment.
• Strong understanding of model registry concepts and best practices.
• Experience with programming languages and ML frameworks (e.g., TensorFlow, PyTorch).
• Proven track record of optimizing ML workflows and processes.
• Excellent communication and leadership skills, with experience in mentoring and training team members.
• Ability to work in a fast-paced, collaborative environment.

Principal Analyst – MLOps Engineer Read More »

Senior ML engineer

Factspan Overview:
Factspan is a pure play data and analytics services organization. We partner with fortune 500 enterprises to build an analytics centre of excellence, generating insights and solutions from raw data to solve business challenges, make strategic recommendations and implement new processes that help them succeed. With offices in Seattle, Washington and Bangalore, India; we use a global delivery model to service our customers. Our customers include industry leaders from Retail, Financial Services, Hospitality, and technology sectors.

Responsibilities

➢ Selecting features, building, and optimizing classifiers/regression using machine
learning and deep learning techniques
➢ Proficient in using data analytics tools to perform queries and analyses and for defining
and correlating data, and skilled at utilizing data visualization platforms to organize
and present summarizations, predictive analysis, comparative analysis, dashboards, and
reporting.
➢ Processing, cleansing, and verifying the integrity of data used for analysis.
➢ Performing data mining and analytics to support ongoing continuous risk monitoring
and risk assessments of operational data to recognize patterns and trends, investigate
anomalies, and assess internal control environment.
➢ Utilize data analysis by leveraging various statistical techniques, and predictive
modeling to drive and identify indicators of risk
➢ Drive efficiency by automation of manual processes

More responsibilities in detail:
➢ Excellent understanding of machine learning algorithms, such as Random Forest,
Gradient Boosting, Naive Bayes, SVM, KNN. Good understanding of deep learning
algorithms, such as DNN, CNN, RNN, LSTM, Autoencoders.
➢ Deep Knowledge of ML/AI software and packages such as python: scikit-learn,
TensorFlow and R: CARET, PyTorch.
➢ Proficiency in statistics concepts: sampling theory, descriptive statistics, probability
distributions, statistical tests, dimensionality, reduction, Hypothesis testing, maximum
likelihood estimators, inference, etc.
➢ Expertise in model validation, hyperparameter tuning, and model selection techniques
such as cross validation, leave-one-out, bootstrap.
➢ Proficiency in using query languages such as SQL and spark.
➢ Services, Reporting Service, Power BI, Python, PySpark- Distributed Computing. Machine
Learning, Times Series, Data Mining, Mathematical, Modeling, Probability and Stochastic
Processes

Senior ML engineer Read More »

Scroll to Top