Data Architecture and Design:
- Collaborate with cross-functional teams to understand data requirements and design efficient and scalable data architectures.
- Develop and maintain data models, schema designs, and data flow diagrams.
ETL Development:
- Design, develop, and optimize Extract, Transform, Load (ETL) processes using Python and PySpark.
- Implement robust data pipelines for efficient data extraction, transformation, and loading from various sources to data warehouses.
Data Processing and Transformation:
- Leverage PySpark for large-scale data processing, ensuring high performance and reliability.
- Implement data transformations, aggregations, and cleansing procedures to maintain data quality.
Data Integration:
- Integrate data from various sources, including structured and unstructured data, ensuring consistency and accuracy.
- Work closely with data scientists and analysts to understand their data needs and provide support for data integration into analytical models.
Performance Optimization:
- Monitor and optimize data processing and ETL jobs for performance, scalability, and efficiency.
- Troubleshoot and resolve issues related to data pipeline performance.
Data Quality and Governance:
- Implement data quality checks and validation processes to ensure the accuracy and reliability of the data.
- Enforce data governance policies and best practices.
Collaboration and Documentation:
- Collaborate with cross-functional teams including data scientists, analysts, and business stakeholders to understand and address their data requirements.
- Document data engineering processes, ETL workflows, and data architectures.
Technology Research and Adoption:
- Stay abreast of the latest trends and advancements in data engineering and recommend the adoption of new technologies and tools to enhance efficiency.
- Minimum of 7 years of hands-on experience in data engineering, with a focus on Python and PySpark.
- Proven experience in designing and implementing scalable and efficient ETL processes.
- Strong knowledge of data modeling, data warehousing concepts, and database systems.
- Experience with big data technologies and distributed computing, with a focus on PySpark.
- Proficiency in working with cloud platforms such as AWS, Azure, or Google Cloud.
- Strong problem-solving and troubleshooting skills.