Data processing experiment - Part 10

The one where I refactor to clean up and use more polymorphic serialization to simplify and reduce code

Apr 06, 2024

The code for this project is available in GitHub - I’m using a branch for each part and merging each part into the latest branch. See the ReadMe.md in each branch for the story.
Github repository for this project
Pull requests for each part
Branch for part-10

In its current state the framework can:

load tables from configuration
- trim,
- handle multiple formats
- handle multiple column names
select only configured columns
convert to types
remove invalid rows
deduplicate
generate statistics
apply a pipeline of tasks
- join
- union
- map values
- add literal columns
- write output

The framework is extensible so more types, statistics and tasks can easily be added as future requirements evolve.

The configuration for the reference application be seen here:

Here's the output from the reference application for each stage for comparison.

This week I haven't been able to add any new features, but I have done some clean up and refactoring. Since discovering how to use kotlin polymorphic serialization for the pipeline work, I now have an appreciation of how much it can simplify the codebase. Accordingly I've modified the table configuration to use this so it directly instantiates types instead of creating a generic type definition which then has to be transformed into the type - reducing code and complexity...

Column configurations now have a type property which refers to a concrete class:

{
  names: ["amount"],
  alias: "amount",
  description: "amount can be a positive (credit) or negative (debit)",
  type: {
    type: "com.example.dataprocessingexperiment.spark.data.types.DecimalType",
    precision: 10,
    scale: 2
  },
  required: true
}

Now the type classes can have specific fields:

DecimalType has precision and scale
DateType has formats

This is much better than before where there was just a generic column class covering all types - a single formats string list handled parameters.

Next week I'm going to experiment with Notebooks to implement similar functionality...

Some options I hope to look into over the next couple of weeks are:

Paul’s Software Substack

Discussion about this post

Ready for more?