Job Overview:
We are seeking a talented PySpark Developer with a strong background in distributed computing, data splitting, and parallel programming to join our dynamic data engineering team. The ideal candidate will have hands-on experience working with Resilient Distributed Datasets (RDD) and deep knowledge of scalable data processing frameworks.
Key Responsibilities:
· Design and implement scalable, distributed data processing pipelines using PySpark.
· Work with RDDs to efficiently manage large datasets across multiple nodes in a distributed cluster environment.
· Develop and optimize PySpark applications for data splitting, parallel processing, and transformation.
· Collaborate with data scientists and analysts to support their data preparation and processing needs.
· Optimize performance of Spark jobs by tuning partitioning, shuffling, and caching strategies.
· Ensure fault-tolerant, resilient data pipelines using PySpark’s fault-tolerance mechanisms.
· Troubleshoot and resolve issues related to distributed data processing in a cloud or on-premise environment.
· Contribute to the overall architecture and design of large-scale data systems.
Qualifications:
· Bachelor’s or Master’s degree in Computer Science, Data Engineering, or related field.
· Proven experience in PySpark development and a deep understanding of Spark’s RDD API.
· Expertise in data partitioning, parallel programming, and handling large datasets in a distributed environment.
· Strong knowledge of Spark’s DAG execution, lazy evaluation, and optimizations.
· Experience with Hadoop, HDFS, and other distributed storage systems.
· Knowledge of SQL and experience working with structured and unstructured data.
· Familiarity with cloud platforms (AWS, Azure, GCP) and distributed storage solutions is a plus.
· Strong problem-solving skills and the ability to optimize data flows for performance and scalability.
Preferred Skills:
· Experience with Spark Structured Streaming for real-time data processing.
· Proficiency in Python or Scala for data manipulation and pipeline development.
· Knowledge of data pipeline orchestration tools like Apache Airflow or Kubernetes.
Benefits:
· Opportunities for professional growth and learning in the field of big data
For interested candidates:
Kindly send you resume to: [email protected], OR please click ‘Apply Now’.
We regret to inform that only candidates will be contacted.
Dianne Balmaceda Antonio
R1105287
BGC Group Pte Ltd
EA 05C3053