Blending Nodejs and R projects

Posted: 24 December, 2018 Category: code Tagged: RRStudionodejs

After getting through a version of the ubiquitous R text-mining tutorial, I wondered if I could kick the whole thing off, programmatically, from nodejs. I could. This is how:

A botched attempt with 'r-script' package.

The outdated r-script wasn't working for me. I started doing something I'd never done before: editing an npm package inline, right inside the node_modules folder. Gasp!! I tried:

  • excising the callSync method from its object prototype
  • changing the launch script to launch R directly instead of its own internal launcher
  • promisifying it

... and then when I got the point where I was questioning its dependence on the needs package, I realised wtf this thing is actually a tad overwrought for my needs. I literally just need to invoke R, and node has child processes out the box. So:

Invoking directly from Node

This is all I needed, in the end:

const path = require('path');
const spawn = require("child_process").spawn;
require('dotenv').config();

// this is the path to the r text mining cript. It's more or less
// hardcoded here but could be a cmdline arg in a more elaborate setup
const rscriptPath = path.resolve("src", "sourcerer-core", "wordcloud.R");

const callR = (path) => {
  return new Promise((resolve, reject) => {
    let err = false;
    const child = spawn(process.env.RSCRIPT, ["--vanilla", path, "--args", process.env.RBASEDIR]);
    child.stderr.on("data", (data) => {
      console.log(data.toString());
    });
    child.stdout.on("data", (data) => {
      console.log(data.toString());
    });
    child.on('error', (error) => {
      err=true;
      reject(error);
    });
    child.on('exit', () => {
      if(err) return; // debounce - already rejected
      resolve("done."); // TODO: check exit code and resolve/reject accordingly
    });
  });
}

console.log("Invoking R script... at:", rscriptPath);
callR(rscriptPath)
.then(result => {
  console.log("finished with result:", result);
})
.catch(error => {
  console.log("Finished with error:", error);
});

Yep. That's all. No need for reliance on other packages. I just had to find where the R executable lived, on my machine. Turns out it's called rscript.exe (I'm on windows). Rstudio had long abstracted me from this fact. :)

The --vanilla flag, as I understand it, basically causes rscript to give you a brand new, clean, r session with nothing loaded... this is the safest option for reproducibility in any environment. The --args flag basically tells R to only pass the commandline switches appearing after it, into your r script as arguments. So in the actual R script, I have this:

# handline commandline args... 
# rem: with TRUE option below, #args[1] is the "--args" switch; skip it.
args <- commandArgs(TRUE)

How do the two codebases live side by side, and which editor is dominant?

Well, so far I've resisted the urge to do my r editing in vscode. All the js is edited in vscode, and believe it or not, all the R is edited in RStudio, which is free and already installed on my machine. RStudio is best for editing and playing with R scripts, so I figure, let it do its job.

One key to peaceful co-existence is atually the /data folder, which has inputs, staging and output folders. It is infinitely easier to use the filesystem for data handshakes than trying to squirrel json between 2 processes like the r-script package was trying to do. Infinitely more debug-able and scale-able too, far's I'm concerned.

Another key thing is that the R-script must be prepared to install libraries that don't already exist in the environment, instead of barfing and shuddering to a halt. This is the workaround I'm using (I found it online, tried it, and it works):

# utility for checking is something is already installed, then loading.
usePackage <- function(p) {
  if (!is.element(p, installed.packages()[,1])) {
    install.packages(p, dep = TRUE)
  }
  require(p, character.only = TRUE)
}

Hereafter, all your cutesy library() calls need to become usePackage() calls.

The last piece of the puzzle is git. When it comes to interfacing with git, I let vscode run the show because I prefer its git interface. To be fair, Rstudio is git-aware too and you can also access your commandline from RStudio.. I just felt happier doing this from vscode.

My source tree ended up looking like this: source tree

The core R project can load all by itself in RStudio, regardless of the surrounding project. I basically just tucked it away one folder down inside the source tree. I may evolve how I do this going forward: this was a first stab.

The actual text-mining stuff

I followed this AWESOME tutorial. If you're new to R, you will love it. For my corpus I just dug up the Act of Canada. I used the tutorial to mine tokens and clean the text data to build the wordcloud. Some things I learned along the way:

  1. Learned what a Document Term Matrix is: source tree

  2. Throughout the text-mining steps, inspecting the data shows you lots of different things... namely:

    • transformations can often reshape your data structures: rows become columns and vice versa... depending on how the underlying utility works.
    • each line in the original document was actually being treated as text corpus (document) in its own right
  3. Programmatically outputing charts and plots in R is a bit finnicky, and you have to deal with "devices" that handle different media types. I figured it out after a bit of googling, but I still don't know why the raw output as seen directly in Rstudio, is much worse than the explicit pdf device output. shrug. Needless to say, I stuck with the pdf output: source tree

You can play with the codebase if you like.