Data processing experiment - Part 10
The one where I refactor to clean up and use more polymorphic serialization to simplify and reduce code
The code for this project is available in GitHub - I’m using a branch for each part and merging each part into the latest branch. See the ReadMe.md in each branch for the story.
In its current state the framework can:
load tables from configuration
trim,
handle multiple formats
handle multiple column names
select only configured columns
convert to types
remove invalid rows
deduplicate
generate statistics
apply a pipeline of tasks
join
union
map values
add literal columns
write output
The framework is extensible so more types, statistics and tasks can easily be added as future requirements evolve.
The configuration for the reference application be seen here:
Here's the output from the reference application for each stage for comparison.
This week I haven't been able to add any new features, but I have done some clean up and refactoring. Since discovering how to use kotlin polymorphic serialization for the pipeline work, I now have an appreciation of how much it can simplify the codebase. Accordingly I've modified the table configuration to use this so it directly instantiates types instead of creating a generic type definition which then has to be transformed into the type - reducing code and complexity...
Column configurations now have a type property which refers to a concrete class:
{
names: ["amount"],
alias: "amount",
description: "amount can be a positive (credit) or negative (debit)",
type: {
type: "com.example.dataprocessingexperiment.spark.data.types.DecimalType",
precision: 10,
scale: 2
},
required: true
}
Now the type classes can have specific fields:
DecimalType has
precision
andscale
DateType has
formats
This is much better than before where there was just a generic column class covering all types - a single formats
string list handled parameters.
Next week I'm going to experiment with Notebooks to implement similar functionality...
Some options I hope to look into over the next couple of weeks are: