Friday, November 3, 2017

Making a case for NoSQL

Yesterday at work, some colleagues and I had a discussion on databases (riveting, I’m sure, for all the cubicle dwellers around us), and somehow the topic of NoSQL came up.   I tried to explain NoSQL and to make a case for it, but in the heat of the moment, I had some trouble conceptualizing scenarios in which NoSQL made sense.   I’m a tad introspective, so when I left work that afternoon the thoughts of NoSQL continued to bubble and gurgle in my mind the whole drive home.

In an effort to practice writing a bit more, I thought, it would be a good exercise to dump my meanderings   gracefully pour those thoughts out into a refreshing pool of insight.  Here goes…

My database background is primarily relational (Relational Database – RDB).   I’m mostly self-taught when it comes to computer-skills, and I remember disagreeing with my manager early in my career over where or not it would be good to normalize a one-to-many relationship.   I was naïve, and I thought it would be easier to just add 10 or so child fields to our primary Access (!) database table.   It wasn’t immediate, but that discussion was part of the epiphany that opened my eyes to relational design.

There is a certain beauty to relational design: 
  • It reduces data footprints (at least if done right) – If we think of a library-type database, it’s a lot more efficient (disk-space-wise) to store an Author ID integer for each of our 3,000,000 books, than it would be to store 3,000,000 separate ‘Author First Name’, “Author Last Name”, etc. string-based fields.
  • It improves data integrity – In that same database, it would be much easy to keep up with 1 ‘Charles Dickens’ author, than it would be to pick out all the various iterations of ‘C. Dickens’, “Charle Dickings’, etc, that people might have entered as authors for the various books.

 That doesn’t come without a cost, though:
  • De-normalizing data requires work – Well-designed schemas are elegant and efficient, but it does take a little effort (for man and machine) to unravel that.  Database servers are very good at that (it’s almost as though they were designed specifically for the purpose of handing data), but it’s not always a trivial thing even for them….and even trivial things take their toll when you’re being asked to do them in bulk.

So let’s branch out a bit.   Library’s are great and all (unless you ask Ron Swanson), but video games are more my speed, so I’m going to imagine a shooter-game.

There are probably players in the game, so an RDB would likely need a Player table.   There’s also going to be a collection of available weapons, so we probably need a Guns table.

Players will have guns, so we would want a many-to-many relational table for that [PlayerGuns].

Guns also have ammo, but to make things interesting, there may be different types of ammo for each gun (hollow-point, slug vs. pellet shells, etc.), so we also need some tables to handle that ([Ammo], [PlayerGunAmmo]).

Maybe players can also customize their guns, so, maybe, to keep it simple, the [PlayerGuns] table simply has a reference to our [GunSkins] table, but stickers also cool, so maybe each player-gun can have multiple stickers.  So, we also need [Stickers], and [PlayerGunStickers].

Guns are only part of the equation, though, so our players also need some [Gear] (& [PlayerGear]). 

Our database design is starting to get fairly complicated now, but, again, this is what Database Servers are good at…

I’m going to take a little intermission now, and babble wax poetic about websites for a bit

Let’s suppose I design this super-cool, web page that includes a “real-time” animated clock that ticks in time with the actual…well…time.  Pretty awesome, right?  I’m sure no one’s thought to do anything like that before.  Anyway, the way this thing works is that a user types my URL into their browser address bar, DNS servers track down my webserver which then receives the request, and generates a bundle of content in response.  It then ships this content back to the user’s browser which renders the page.

Somehow (magic!), the page requests a reload every second, and so every second, that same process repeats, and voila!, the user has a pretty cool animated clock.  Internet speeds are good, my packet size is small (that's what she said!), and web-servers are good at serving web-content, so it’s a pretty good user experience.

Word gets out, though, and suddenly everyone is logging on to my page, and before you know it, my web server is having to serve up millions of new pages every second.   Before long, my page performance becomes terrible, and my 15 minutes of fame quickly runs out as everyone grumbles about what an idiot I am.

In this scenario, I could have used JavaScript to update the clock client-side instead of server-side.  Instead of having (potentially) millions of users all asking me (well, my web server) to generate content, that work load can be distributed to each person’s computer.

…There was a point to that side bar.   One of the big benefits to NoSQL is that it allows the data workload to be distributed in a similar fashion.   Data can still be complex and sense needs to be made of it, but if we can encapsulate it well, then we can let a million devices do some of that work instead of forcing our database server to do it all.

To wrap things up, a relational design for hypothetical shooter game would be good in keeping the data trim and well-controlled, but hard drive space (cloud or otherwise) is cheap now, and the integrity of gun stickers, and ammo types can (and likely would, at least partially) need to have some application logic involved anyway.
 
In our NoSQL scenario, something like PlayerInfo.json, can keep up with all that data (and more) in a nice, nested structure and the data server doesn’t have to fool with connecting the dots.   It’s always stored as a complete package.   

This particular scenario is also good because there’s not much interaction between the data “packages”.   Maybe my player info tracks my kill-count (but hopefully not my deaths), but even that (which tangentially involves other players) doesn’t have to interact with “their” data.   The application code can increment my kill (or more likely killed) count without having to maintain strict transactional considerations involved outside data.


Another benefit (though it can sometimes feel otherwise) is a lack of strict schema definitions.   In a more rigid database, properties are well-defined, which is nice, because you always know what you’re going to get, but if the application is prone to changes, then maintaining a strict schema can be difficult.   If we decide to add a bonus gun for everyone’s birthday, then we need to add a Birthday field to the Player record in our database, but what do we do with all those existing folks who clearly don’t have the (non-existent until now) Birthday value filled in?  Our application code would have to handle that situation anyway, so it’s not overly cumbersome to make it do so without a strict schema in place.

NoSQL isn't a magic bullet (or  even an incendiary slug shell with camo skin and a smilie face sticker) for every situation, but it does have its place, and it provides a nice paradigm for distributed systems involving fairly well-isolated data.

No comments:

Post a Comment