An approach to in-place data transformation in mongodb

Posted: 04 February, 2019 Category: code Tagged: data-wranglingmongodbsoftware engineering

Recently I had to write a script to upgrade one of our database schemas while also making updates to the actual data in the database.

Approach

1 - Iterate over the collection

Basically, I went with cursors, because 1) I didn't want to use toArray() and hold everything in memory locally in a giant array, and 2) at the moment I wasn't dealing with anything fancy like shards and replica sets... in other words, no unstable master potential. So:

high level process

In my actual implementation I went with a method that encapsulates all the messy little data transformation steps. So my loop was one level down: I grab the cursor and basically hand it over to my migration function to handle iterating over the documents, and wait for it to finish. Makes the high-level workflow more readable. But to simplify this blog post, let's forget all that and focus on the fact that for each document, some magical perform data transforms action occurs.

2 - Trace business rules and reqirements to specific data transforms

During the migration there were a bunch of very small pieces of business logic that needed to be executed. Most of these were transition requirements, but not all. Regardless of the type of requirement they represented, I realized that each piece of business logic implied a tiny change to the data. That gave me the idea to make this mapping more explicit, and just have each rule/requirement spit out its own small fragment of data that was relevant to it and only it:

const transformDocument = (doc) => {
  // updateDelta 1:= determine delta based on business rule / requirement #1
  // updateDelta 2:= determine delta based on business rule / requirement #2
  // updateDelta 3:= determine delta based on business rule / requirement #3
  ...
  // updateDelta N:= determine delta based on business rule / requirement #N
}

Tracing business rules to transforms

Aside: It bears mentioning that in real life, precedence may well matter and you should pay attention to ordering. To migitate interference between the business rules, I extracted any/all deleterious actions to a final stage in my migration pipeline. In other words, always break down requirements such that additive transforms go first, followed by deleterious transforms ideally at the very end).

3 - Reduce the necessary transforms into a single aggregate transform

At the end, a straightforward Object.Assign() allows you to aggregate all the changes needed to update or migrate the current document to the desired new state.

reducing the deltas

Aside: In my case, since I had delegated my cursor iteration to a function that would handle calling the actual transformer, I had an array of promises to wait on. Promise.all to the rescue! The resulting array is easily .reduce()'d. I admit it seems silly to have gone to the trouble of using cursors and then still having an array which would, worst case, be the same length as the original query resultset. I console myself with the fact that this would still be a considerable memory saving because:

  • not every document will pass the criteria of each business rule in the transformer, so the length of the resulting array will be a lot less
  • we are not storing every field in each document, since we are concerned with small additive changes.

Results

So far this has worked well and has proven to be a fairly robust approach to safely changing data in-place. I think it is good for small patches to data. Secondly, separating the business logic from the database management logic is a clean separation of concerns and allows the utilities at the database layer to be reusable for other data-wrangling tasks.

Tips

  • Always backup first
  • Always pay attention to the interaction between business rules and the transforms they will incur
  • Save deleterious actions till last
  • Extra safety points: "Abort Mission" on any transforms whenever anything smells wrong: if your code iterates to a doc that seems iffy, leave it the hell alone and just truck on to the next thing you're sure about modifying. A human can always wade in later.

Other things to be cognizant of: mongoshell is not the nodejs mongoclient, which in turn is most certainl NOT mongoose,the ORM whose inescapable gravity well you've probably fallen into. (So much for schemaless databases).