advanced analyticsBIBusiness Intelligence

A few years ago, I was working as an external advisor on a final year project of a grad student. He wanted to develop a unified search engine for online deals available on e-stores regionally; somewhat like Google Products, but localized for different regions.

The idea was for a user to come to his search page and enter his queries like Nikon D3300 or Full Sleeve Hoodie for Men, etc. The search engine would be able to tokenize the query, identify the product, category, and other product properties, then use these properties to find appropriate deals from all the supported online stores.

It would have been too slow to query all the e-stores at runtime. Therefore, it was decided to develop and run several background jobs (separate crawler services for separate e-store) designed to fetch the data for products and deals pre-emptively and fill in a local Database. An end user query would be served from this local database. Utilization of this technique would also effectively make the integration of this search engine with a BI product seamless. The local database can be used as source and desired reports & visualization can be easily generated.

The problem? There was an endless variety of products being sold online like medicines, books, electronics, gadgets, crockery, furniture, apparel and even groceries. Every product has a different set of properties like size, color, speed, expiration date, and area, etc. This raised several design related questions like how to handle such versatile data in terms of data types and relations (unlimited types of product attributes like size, color, fabric for clothing, but memory size, speed for electronic gadgets, etc.), hard to find relations between similar products from different e-stores (leading to storing several copies of same product since it was hard to correlate), the data size was growing tremendously (getting all deals for all the products from all stores), parallel writes from several backend jobs could raise concurrency issues, and so on.

Each and every application ranging from medium-sized to enterprise level being developed in this era are designed to work seamlessly with any available modern BI product or Data Visualization tool. Data required in multiple formats is the demand put forth by business users, managers and executives. Huge data coming at rocketing speeds without being properly managed is a cumbersome situation for any BI application to effectively entertain the requests of its users. Selection of the type of database to employ is a very relative and valid question at this point.  At that time, I was trained to work with relational databases only and some of the above-mentioned cases were challenging for me. It was hard to conceptualize a good ER design for the current problem. An obvious solution was to have a table “Products” with some simple fields like Product ID, Category and Product name, and another table to hold key-value pairs to hold variable number of attributes for products, but I was not comfortable with this because I always like to enforce integrity of data using strict field types, constraints, and relationships at the backend.

The idea that was pitched, leveraged the use of a NoSQL database. My first reaction (with limited knowledge of NoSQL databases) was that it would be slow since it did not have strict data types and could not index data. How would the data integrity at the back end be managed? Would it work for massive datasets? To explore, we decided to do some research on it.

After conducting some research, we realized that NoSQL is specifically designed to work with non-structured/semi-structured data, works better in high concurrency scenarios (parallel reads and writes), and is better in the case of massive data and so on.

If we scrutinize this problem from a BI analytics perspective, it raises many queries in one’s mind related to data discovery, visualizations, dashboards and self-service analytics. Huge amounts of data that is not well organized in a structured format with proper relationships and integrity, generating reports on it of any sort become extremely difficult and the accuracy of the reports either sales, product comparisons or financial nature becomes highly questionable. Taking an informed decision by the management seems difficult. But the trend of technology is shifting towards the run-time management of semi-structured data by offering data wrangling and integration capabilities in various BI products. With the use of extensive self-service analytics, business users can easily massage the data and then generate reports, charts, gauges and other visualizations using the Data Visualization tools.

When should Non-Relational Databases be used?

Relational databases are based on relational algebra, uses Structured Query Language (SQL) for representing and storing data in tables and rows; are good in handling lots of complicated querying, database transactions and routine analysis of data, which must be processed reliably utilizing a certain set of properties.

Non-Relational databases, however, represents data in collections of a lightweight data interchange format called JavaScript Object Notation (JSON) documents, CSV and TSV and are best for de-normalizing the very complex database schemas that allow developers to store serialized arrays in JSON objects and in the same collections that have different fields or attributes.

The limitation of using a non-relational database comes forth when the developers find themselves unable to create joins, like in relational databases; therefore, they need to join the data manually within the code. Worse, there is no automatic treatment of transactional operations; therefore, the developers must manually choose to create a transaction, manually verify and commit it and roll it back. The non-relational database model is still a game-changer as developers can store data without explicit and structured mechanisms to link data from different tables (or buckets) to one another.

Putting it simply, a relational database should be used when the data being managed is of the same nature and creating relationships amongst with minimum development is required. However, a non-relational database can be utilized when the data is coming in from multiple sources, at numerous speeds and volume.  Each type of database technology has its pros and cons, but at the end, the more robust and accurate choice can only be made based on experience and knowledge of the same.

Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment