Beschreibung
Job Purpose:The data engineer in the products team is responsible for expanding and optimizing the client's data and data pipeline architecture, as well as optimizing data flow and transformations.
Looking for data engineers bringing expertise in two or more of the following areas:
* Building reliable data pipelines using Spark / Python / R
* Provisioning data / providing access, building REST/similar APIs back-ended by technologies such as PostgreSQL/ElasticSearch/S3/…
* Clinical trial datasets (SDTM/ADAM/…)
* Biological datasets (e.g. omics; DNA/RNA)
Major Accountabilities:
* Design, create, test, and maintain optimal data pipeline architecture to ensure that it supports the requirements of the stakeholders
* Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources
* Develop data set processes for data modelling, mining, and production to deliver data
* Delivery of clear, maintainable, and well-tested code in a timely manner
* Identify, design, and implement internal process improvements; automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc.
* Create data tools/scripts for data curators and data scientists as needed
* Collaboration with scientists, data managers, and technology teams to document and leverage their understanding of the data
* Actively participate in agile work practices
* Adopt and improve on the strong engineering practices followed by technology teams
* Analyse analytical blueprints to identify technical gaps and build best practices
Ideal Background
Education (minimum/desirable):
MSc or PhD in a quantitative / computational science (e.g. bioinformatics, machine learning, statistics, physics, mathematics, ...)
Required technical and scientific skills:
* Software engineering experience (versioning, Scrum, testing, collaboration good practices)
* Some experience with Python for pipelines, engineering practices, … (e.g. Python + Spark; Dask, Snakemake, etc.)
Additional skills in two or more of the following areas are highly desirable:
* Strong experience with Python for pipelines, engineering practices, (e.g. Python + Spark; Dask, Snakemake, etc.)
* Some experience with R for data analysis (SparkR/sparklyr, tidyverse, mlr, ...),
* Experience with computational environments for large-scale processing
(high-performance computing and/or Spark),
* Analysis of clinical trial data, incl. understanding of data formats, processes.
Michael Bailey International is acting as an Employment Business in relation to this vacancy.