( In this SmartBits, Arun Krishnan outlines “Challenges in testing big-data applications“. The video is at the end of this blog)
“I understand that big data is characterized by three V’s volume, velocity and variety with the data formats classified into different categories of structured, semi-structured data and unstructured data, and these are acquired from a variety of sources. What are some of the challenges or issues that these pose to validation?” Dr Krishnan’s answer to this question is below.
The Three V’s Volume, Velocity and Variety actually depends on who you are. There are 4V’s and 5V’s as well. Some of the definitions around Big Data are volume, variety, velocity but in a true sense all these are relative.
There are people who say 1TB of data and above is Big Data. The best definition as of today is one bit more data than your system can handle. If there is a system with 8 GB of memory and there are 8 GB and 1 bit of data if this can’t be loaded into memory that is the Big data and then it needs to be broken into chunks.
When we talk about Big Data, we need to understand that it is relative. What Big Data is to a retail chain where there is a point of sale data coming in every second or every minute need not be the same for a company which tests the software where the focus is on looking at test results coming in every few minutes or every hour.
HR data from a retail chain perspective is not big data, but from an HR perspective, yes they do, they have a variety of data sets coming in and they got to pull it all together The trick is in bringing all the data together and then get deeper into it. It’s not about the data quantity.
Data are in different forms like structured, semi-structured or unstructured. How we tie it all together and how we gain insights from them, is analytics. Another example is one of my students had been to an internship at an Indian public sector unit, and there he was asked which are the best colleges to hire from. This is a huge amount of data that one could gather. This student did something really simple and straightforward. He took the average scores for every College on performance and he took the average scores on that amount of time that college folks have spent in that organization. Plot the data with Y-axis as performance and X-axis the amount of time spent. Then arbitrarily, take two values one parallel to the x-axis one parallel to the y-axis. It suddenly has four quadrants and interestingly enough all the IIT’s came in the bottom quadrant, which is low retention and low performance. Very simple things can be done in analytics but the idea here is tying these two pieces of data together.
Even for testing it is important to figure out how we can tie real-time data coming in from devices, what we are getting in from server logs, as well as what Developers might be putting as comment, and then use that to infer what would be the issue and then build the test cases.