Job Description
Role: Data Tester
Location: Louisville, KY (Remote)
Type: Contract
Job Summary:
- We are seeking an experienced Data Tester with strong expertise in Databricks, PySpark, and Big Data ecosystems. The ideal candidate will have a solid background in testing data pipelines, ETL workflows, and analytical data models, ensuring data integrity, accuracy, and performance across large-scale distributed systems.
- This role requires hands-on experience with Databricks, Spark-based data processing, and strong SQL validation skills, along with familiarity in data lake / Delta Lake testing, automation, and cloud environments (AWS, Azure, or GCP).
Key Responsibilities:
- Validate end-to-end data pipelines developed in Databricks and PySpark, including data ingestion, transformation, and loading processes.
- Develop and execute test plans, test cases, and automated scripts for validating ETL jobs and data quality across multiple stages.
- Conduct data validation, reconciliation, and regression testing using SQL, Python, and PySpark DataFrame APIs.
- Verify data transformations, aggregations, and schema consistency across raw, curated, and presentation layers.
- Test Delta Lake tables for schema evolution, partitioning, versioning, and performance.
- Collaborate with data engineers, analysts, and DevOps teams to ensure high-quality data delivery across the environment.
- Analyze Databricks job logs, Spark execution plans, and cluster metrics to identify and troubleshoot issues.
- Automate repetitive test scenarios and validations using Python / PySpark frameworks.
- Participate in Agile/Scrum ceremonies, contributing to sprint planning, estimations, and defect triage.
- Maintain clear documentation for test scenarios, execution reports, and data lineage verification.
Required Qualifications:
- 8+ years of overall experience in data testing / QA within large-scale enterprise data environments.
- 5+ years of experience in testing ETL / Big Data pipelines, validating data transformations, and ensuring data integrity.
- 4+ years of hands-on experience with Databricks, including notebook execution, job scheduling, and workspace management.
- 4+ years of experience in PySpark (DataFrame APIs, UDFs, transformations, joins, and data validation logic).
- 5+ years of strong proficiency in SQL (joins, aggregations, window functions, and analytical queries) for validating complex datasets.
- 3+ years of experience with Delta Lake or data lake testing (schema evolution, ACID transactions, time travel, partition validation).
- 3+ years of experience in Python scripting for automation and data validation tasks.
- 3+ years of experience with cloud-based data platforms (Azure Data Lake, AWS S3, or GCP BigQuery).
- 2+ years of experience in test automation for data pipelines using tools like pytest, PySpark test frameworks, or custom Python utilities.
- 4+ years of Strong understanding of data warehousing concepts, data modeling (Star/Snowflake), and data quality frameworks.
- 4+ years of experience with Agile / SAFe methodologies, including story-based QA and sprint deliverables.
- 6+ years of experience in analytical and debugging skills for identifying data mismatches, performance issues, and pipeline failures.
Preferred Qualifications:
- Experience with CI/CD for Databricks or data testing (GitHub Actions, Jenkins, Azure DevOps).
- Exposure to BI validation (Power BI, Tableau, Looker) for verifying downstream reports.
- Knowledge of REST APIs for metadata validation or system integration testing.
- Familiarity with big data tools like Hive, Spark SQL, Snowflake, and Airflow.
- Cloud certifications (e.g., Microsoft Azure Data Engineer Associate or AWS Big Data Specialty) are a plus.
Job Tags
Contract work, Remote work,