Key Responsibilities:
- ETL Development: Use to develop, optimize, and maintain ETL workflows for ingesting, processing, and transforming data from a variety of sources.
- Data Pipeline Design: Develop scalable, high-performance data pipelines using PySpark and Python for batch and real-time data processing. Perform data quality checks and prepare data marts for business applications such as Tableau Control Tower and Campaigns.
- Data Integration: Extract and integrate data from multiple sources such as SQL Server, PostgreSQL, AWS Redshift, and Cloudera.
- Data Quality: Ensure data quality and consistency by building and implementing validation, cleansing, and transformation logic.
- Collaboration: Work closely with data scientists, business analysts, and other stakeholders to understand data requirements and support data-driven decision-making.
- Data Modeling: Assist in designing and developing database tables and schemas, including optimizing the performance of databases like MySQL, PostgreSQL, and NoSQL databases (e.g., MongoDB, Cassandra).
- Troubleshooting & Optimization: Identify and resolve performance bottlenecks, data discrepancies, and pipeline failures.
Required Qualifications:
- Bachelor’s degree in Computer Science, Data Science, Information Systems, or a related field.
- 4 years experience on ETL development.
- Strong proficiency in SAS programming, Spark, SQL and Python
- Extensive experience with Apache Spark for big data processing (both batch and streaming).
- Proficient in working with relational databases (SQL Server, PostgreSQL, MySQL) and non-relational databases (MongoDB, Cassandra).
- Solid understanding of data structures, database design, and data modeling.
- Experience in banking/financial industry.
- Chinese speaking is an advantage since this will work with teams in China
- Strong problem-solving and troubleshooting skills.