Recently I had to write a script to upgrade one of our database schemas while also making updates to the actual data in the database.
Basically, I went with cursors, because 1) I didn't want to use toArray()
and hold everything in memory locally in a giant array, and 2) at the moment I wasn't dealing with anything fancy like shards and replica sets... in other words, no unstable master potential. So:
In my actual implementation I went with a method that encapsulates all the messy little data transformation steps. So my loop was one level down: I grab the cursor and basically hand it over to my migration function to handle iterating over the documents, and wait for it to finish. Makes the high-level workflow more readable. But to simplify this blog post, let's forget all that and focus on the fact that for each document, some magical perform data transforms action occurs.
During the migration there were a bunch of very small pieces of business logic that needed to be executed. Most of these were transition requirements, but not all. Regardless of the type of requirement they represented, I realized that each piece of business logic implied a tiny change to the data. That gave me the idea to make this mapping more explicit, and just have each rule/requirement spit out its own small fragment of data that was relevant to it and only it:
const transformDocument = (doc) => {
// updateDelta 1:= determine delta based on business rule / requirement #1
// updateDelta 2:= determine delta based on business rule / requirement #2
// updateDelta 3:= determine delta based on business rule / requirement #3
...
// updateDelta N:= determine delta based on business rule / requirement #N
}
Aside: It bears mentioning that in real life, precedence may well matter and you should pay attention to ordering. To migitate interference between the business rules, I extracted any/all deleterious actions to a final stage in my migration pipeline. In other words, always break down requirements such that additive transforms go first, followed by deleterious transforms ideally at the very end).
At the end, a straightforward Object.Assign()
allows you to aggregate all the changes needed to update or migrate the current document to the desired new state.
Aside: In my case, since I had delegated my cursor iteration to a function that would handle calling the actual transformer, I had an array of promises to wait on. Promise.all
to the rescue! The resulting array is easily .reduce()
'd. I admit it seems silly to have gone to the trouble of using cursors and then still having an array which would, worst case, be the same length as the original query resultset. I console myself with the fact that this would still be a considerable memory saving because:
So far this has worked well and has proven to be a fairly robust approach to safely changing data in-place. I think it is good for small patches to data. Secondly, separating the business logic from the database management logic is a clean separation of concerns and allows the utilities at the database layer to be reusable for other data-wrangling tasks.
Other things to be cognizant of: mongoshell is not the nodejs mongoclient, which in turn is most certainl NOT mongoose,the ORM whose inescapable gravity well you've probably fallen into. (So much for schemaless databases).