Difficulties of Unstructured Data

Aug 6, 2017 00:00 · 1014 words · 5 minutes read data database machine-learning

A little while ago I was doing a tech interview and got asked to explain the difference between SQL and NoSQL databases. I gave a high level response about structured vs unstructured data and scalability and I think it was a decent answer. But it got me thinking a bit more about the actual challenges associated with unstructured data.

We can start by thinking about structured data as being able to fit into rows in a table. Let’s say you’re needing to store user information, so there would be a name, email, sign up date and a few other columns in this table. User data is actually a great use for Relational Databases (RDMs). Data can be access using SQL which works great as it can optimize for time. The data itself is also optimized and stored to be accessed nice and fast.

So then we can think about unstructured data as being data that isn’t structured. That being it doesn’t nicely fit into a table. This means that it doesn’t work well with RDMs. The schemas would likely be quite complex and there would be a lot more tables requiring a lot of very time consuming table joins. Instead we need something else.

MongoDB is one of the more popular NoSQL databases that specialises in dealing with unstructured data. It is a document based rather than being table based and so each entry in the database is BSON (Binary JSON) object. This can make MongoDB a bit more intuitive, than say MySQL or Postgres, which cuts down on the time taken to come up with complex schemas. And most data that you need will be stored in the same document so there’s no need for those slow joins to happen.

So MongoDB, and NoSQL in general, sound pretty good. And they do have some real, practical benefits over Relational DBs, especially when dealing with unstructured data. For many application where data isn’t structured, using a document oriented DB does make perfect sense. So what makes unstructured data so hard to deal with?

Why is it Hard?

There is so much of it

There is a lot of unstructured data. Most decent sized companies, that have been around for a few years, have more data than a human can consume over 10 lifetimes. And so, to be able to use this data in any logical way we need to feed it through some computer process that will take a few seconds to crunch the numbers, rather than a few generations.

Yet, we also have a lot of structured data. So this alone isn’t a good enough explanation of why unstructured data is hard.

Computers like structure

What makes this large amount of data difficult to work with is that computers like structure. The protocols and processes that govern how our computers use and store data are all very predictable and structured. And so for us to define a structured algorithm to deal with unstructured data is a difficult task.

Machine learning is hard

Enter Machine Learning (ML). There has been a lot of hype around ML lately and so ignoring all that, ML is good at dealing with unstructured data. With ML we are able to process natural language with NLP and determine if a Youtube video is a funny cat video.

But doing this sort of stuff isn’t easy. To do it, and do it properly, requires a lot of people with specialized knowledge and a good chunk of time. Which means it’s also quite expensive to do.

Why Should I Care?

The total amount of data is growing exponentially. With the rise of the internet allowing for innovations like social networks and, more recently the Internet of Things (IoT) we have had a flood of data. From IoT alone there is expected to be around 21 billion connected devices deployed by 2020. These devices will all be collecting analytics and generating logs and all this new data will have to be stored somewhere to be processes.

It is estimated that 90% of all data in existence today was generated in the last five years, and it is doubling every 12 months. By 2020 it is expected to reach 44 zettabytes by 2020. So with this huge amount of unstructured data being generated (an accepted rough figure of 80%) chances are you will have to interact and even process this data yourself, whether you’re an engineer, data scientist or manager.

This data can also be very useful. The IDC projects that organizations that analyse ‘all relevant data’ and deliver ‘actionable information’ will achieve a combined total of $430 billion in productivity gains over their less data-driven peers by 2020.

What Can I Do?

The first step is to understand the challenges that unstructured data imposes, and why it is important to address and engineer solutions to these problems. There are more application specific challenges with big data but this post should serve as a good starting point for understanding the issues.

It is also important to learn at least a little bit of statistics and machine learning. The algorithms and processes under the ML umbrella are going to become ever more important over the coming years.

Finally, it’s important to get some experience working with large scale databases for both types of data. Most projects these days, in all domains, will have some sort of data storage or analysis part which must interact with a database. This is especially common when working on web projects. It is important to cut your teeth before working with these systems on a critical, large scale project to prevent really bad mistakes.

Wrapping Up

It’s pretty clear that dealing with unstructured data poses some unique and interested challenges. And having the skills to find solutions will only become more important with time.

Sources