1.0 Introduction
An explosion of data worldwide has characterized the last few decades. It is estimated that the world created 1.8 zettabytes in 2011 only. Most companies in the modern world have realized that the reality of successful business pursuits in the twenty-first century involves collecting, analysis, and use of big data. Successful analysis and application of big data has already become the main factor that underpins innovation, the growth of consumer base, and productivity. Big data is a well-known term that refers to the “exponential growth and availability of data, both structured and unstructured” (Institute for Health Technology Transformation 3). It is the increase in both ordered and unstructured that an organization is exposed to every day; hence, the effective management of such data may present the firm with new business ideas and knowledge, and eventually lucrative business opportunities. Therefore, it is imperative to understand that explosion and an increase in the use of new data necessitates new algorithms and techniques in its analysis and use, as well a new execution environment to benefit organizations.
2.0 Background
The Congress of the United States of America describes “big data as a term that describes large volumes of high velocity, complex, and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information” (Institute for Health Technology Transformation 4). It is scenarios in data processing and use where there is growth in data to petabytes and more, which makes it challenging to operate within the traditional processing environment. The 3Vs: volume, variety, and velocity are used in defining big data.
2.1 Concepts in Big Data
There are important concepts used in defining big data, as it shall be discussed below.
2.1.1 Volume
Indeed, to describe big data regarding volume, it is important to appreciate the fact that there are many variables in organizations that serve to increase the amount of data that the firm is presented with on a daily basis. Since the advancements made in communication and information technology allow commercial organizations to store data and information at relatively low costs, many firms have large volumes of data at their disposal (Abdrabo 3). The different sources of data in modern day firms include the business data stored by the firm over years of operation, accumulated sensor data, and unordered data from different social media. The large volumes of data in business firms require critical analysis to filter relevant data from the irrelevant.
- Variety
Variety illustrates the tendency by data to appear in different forms. The various formats of big data include numerical and structured data, data from financial deals, e-mails, video, and stock ticker data (Abdrabo 3). Business firms are thus tasked with the responsibility of coming up with strategies that are effective in the process of managing different types of data.
- Velocity
The concept of velocity describes the tendency by big data to streaming into commercial organizations at different levels of rapidity. Owing to the fact that in the twenty-first-century commercial organization data streams in at an exceptional speed, it is crucial that firm managers and administrators ensure that the data is managed in a timely manner (Abdrabo 3). It is noteworthy that numerous business firms in the modern day are confronted with the challenge of responding effectively to the velocity of big data.
- Variability
Variability illustrates that data that flows into business firms may at times be very unpredictable and dynamic depending on the business peaks that exist in different industries. Data streams from daily, cyclic, and unanticipated events must be well managed in to be used in a profitable way by the business firm (Abdrabo 3).
- Complexity
The intricacy of big data emanates from the fact that such data streams into a business organization from a variety of sources (Abdrabo 3). Therefore, it is imperative for commercial firms to refine the data and create associations between data collected from diverse sources.
2.2 Benefits of Big Data to Business
Big data is one of the most important phenomena for business firms in the twenty-first century. In spite of this, it should be noted that the huge volumes of data that stream into a firm on any given day do not underpin the importance of big data. On the contrary, the benefits of big data for business firms are dependent on what such firms do with the data that they collect and effectively analyze (Yin et al. 2). Big data is useful and beneficial to business firms when it is manipulated and examined to achieve efficiency on important business issues such as time, costs, decision-making, and new product development. The manipulation of big data presents numerous benefits for firms acting in different industries.
Effective analysis of big data enables commercial organizations to immediately identify the roots of failures, problems, and defects in business operations consequently enabling the firms to save money and time. Utilization of big data is a major foundation for competition and development in business firms (Manyika et al. 3). Indeed, this is due to the fact that established firms and new entrants are currently utilizing information-driven techniques to innovate and create value from data. Firms with huge big data assets and the capacity to exploit the data effectively thus, gain a competitive advantage over rivals in similar markets.
Organization that critically examine the big data that streams into the firm are easily able to determine the most important segment of the market or consumers who add the most value to the firm. In industries such as the banking sector, big data may be exposed to the processes of data mining and clickstream examination to perceive or discover any form of deceptive behavior (Manyika et al. 3). In this case, the commercial organizations that engage in the thorough analysis of big data can recalculate the whole of their risk portfolios in real time.
The diverse benefits of big data are grouped together in the figure below.
(Forbes Insights and Rocket Fuel, 1)
3.0 A Review of Big Data Algorithms and Techniques
3.1 Growth in Data
There is no doubt that there is an increase in the collection and use of data in organizations and the trend is on the upward move. As indicated in Figure 1 below, data growth was expected to increase by at least 10-20% annually to 70% year over by the year 2015.
Figure 1
An explosion of data greatly impacts every industry in the modern commercial realms. As a result of this aspect, data in business firms are as important as other factors of production; for instance, capital and labor (Yin et al. 2). Organizations in every sector are currently caught up in the middle of a big data revolution. In the last ten years or so, the business world has been characterized by many significant transformations regarding the digitization of functions. These transformations have resulted in increased efficiency in all industries. The use of digital technology has led to the huge volumes of data stored in databases all over the world.
Different actors in the business sector are currently making use of big data to gain insightful knowledge regarding the delivery of quality services and products to clients.
In fact, several factors underpin the increased demand for big data applications. Indeed, on the most important features are economic factors (Yin et al. 2). One of the areas where there has been the increase in data is the health care industry. To reduce the expenses in the American healthcare sector and dissuade any forms of overutilization, actors, and stakeholders in the healthcare industry have to engage more in the process of amassing and transmitting information.
Another main factor that has played a significant part in the proliferation of big data applications is the transformation perceived in functions (Yin et al. 3). While experts have conventionally used their expert judgments to make decisions, in the twenty-first-century professionals have inclined more towards evidence-based decision-making. Such decision-making requires the professionals to methodically assess data and make decisions based on an assessment of the available information. The process requires the collection and effective analysis of the available data to make the decisions.
In spite of the fact that professionals in the world of business are traditionally competent in the capturing of value, the algorithms and procedures that are traditionally utilized in this endeavor are not effective in optimal utilization of big data in the sector. Instead of being inclined towards the enhancement of outcomes, these tools are geared towards other objectives such as the reduction of costs (Yin et al., 2). As such, to maximally enjoy the benefits of big data, it is important that they utilize an effective methodology in the capturing of value by ensuring effective data analysis.
3.2 Mining and Analysis of Big Data
The analysis of big data is currently the focus of data analysis in modern organizations. The reason for this is that there is a generation of vast amounts of data from different sources, making it challenging to extract, transform, store and load it. It was recently estimated that Google receives each minute four million searching queries, more users of e-mail transmit more than 200 million emails, 72 hours video clips are put on YouTube, two million contents are put on Facebook, and users of Twitter generate 277,000 tweets. Such data is vast and can be overwhelming regarding its analysis (Hashmi and Ahmad 1).
In fact, the aspect of Big data capturing, the duration it should be stored, shared, analyzed, and visualized are the greatest challenges that IT professionals in organizations have to face. Media organizations, for example, social media and even conventional media organizations are facing the challenge of effectively collecting and analyzing the vast amounts of data (Bolin and Schwarz 5). As a result, there is the need for novel architecture, algorithms, as well as techniques for the analysis and management of big data. There is also a need for new technical skills as the conventional ones cannot be effective in the new environment. Research into the new and effective algorithms and techniques has been ongoing and has provided some good insights into the topic.
Based on the existence of the vast amounts of data with hidden knowledge, the topic of data mining becomes very interesting. Within the topic of machine learning, data mining is critical and plays an important role in ensuring that organizations have the kind of useful information they require for their effectiveness. Organizations are only able to work with information and knowledge that is easy to understand and makes sense to them, which is a major challenge in an environment where the professionals have to deal with huge data (Diaconita 981). Data mining plays an important role in the discovery of the hidden knowledge from large amounts of datasets. The knowledge is critical for effective decision-making by organizations. Effective data mining algorithms and techniques have the potential to generate millions of rules and patterns, including useful knowledge (Bolin and Schwarz 5). The discovery process can be guided by subjective or objective measures of the interestingness of pattern.
In the media, there is an increase in the collection, processing, and use of big data. Thus, it is possible to achieve continuity between the past statistical inference and the modern predictive statistics, but only with the successful use of algorithms. The quantitative audiences’ prediction is critical for success in media environments. Relationships between the industry and the audiences are one of the challenges that can be addressed using data and information processed using the modern technologies and algorithms (Diaconita 983). However, it is imperative to note that the current systems involved in the generation and processing of the data are highly complex. The nature of the mining process today has shifted to highlighting abstract bundles of behavioral patterns (sociograms and correlations).
3.3 Defining Algorithm
Algorithms are the most effective ways of achieving better outcomes from data processing in all environments, including in governance. Because of various activities in governance, such as from surveillance, the players have to deal with massive amounts of data. The data cannot be effectively computed without the use of suitable algorithms (Janssen and Kuk 372). Complex algorithms are being used for success in processing information, which is then used in decision-making. Within the government realm, there are various examples in the application of algorithms, including in surveillance, searching, fraud, decision-making, traffic management, and smart cities. They play a critical part in the collection of data and processing it to get the information necessary for decision-making. From a general point of view, algorithms comprise of the step-by-step process and rules involved in getting output out of the input in data processing.
The idea behind the concept of an algorithm is the sort of intelligence, which is used in the manipulation of data to obtain the necessary information. Simply depicted, algorithms come up as flowcharts or other representational mechanisms, showing the input, processing, and the eventual outcome. Use of algorithms is critical for the computational strategies necessary for the designing of the various aspects of the modern society. Technocratic governance is one of the outcomes of the use of algorithms in the modern society. The idea is built on the idea that “all aspects of a city can be measured and monitored and treated as technical problems which can be addressed through technical solutions” (Janssen and Kuk 372). The sociocultural modeling of the Pentagon, as well as forecasting programs, has continued to increase at a very fast rate (González 13). Like the development of the software for the generation and processing of data, there is a need for further development to address the increase of data they have been able to generate, especially in the government agencies. Complex problems within the society can be understood using algorithms, just as it is possible to compute and get solutions from huge amounts of data.
3.4 Big Data Algorithms
Cloud computing is one of the areas where organizations have to deal with huge amounts of data and processing is capable within the distributed environments, suggesting the use of the algorithms (Ranjan et al., 263). Cloud computing has added complexity in data processing because of the environment within which the data is generated and processed. Together with the necessity for effective processing and storage of big data, data mining is equally necessary for the growth of organizations (Berra 14). The patterns and trends of extraction that are concealed have the potential for producing the necessary business intelligence for the purpose of effective decision making. Organizations are using up a lot of resources in collecting and analyzing data for success in decision making, but this is only possible where they have the most effective algorithms and techniques. For a huge data requiring processing, data mining becomes highly challenging.
While there are considerable benefits from the collection and use of big data, there is a serious challenge involved. As is to be expected, the use of big data is characterized by several challenges that must be urgently addressed to allow business firms to exploit the benefits and opportunities presented by big data (Hashmi and Ahmad 2). Security is a chief challenge in the collection, analysis, and utilization of big data; hence, it is important for the data obtained from business firms to be safeguarded from all forms of intrusion and unauthorized manipulation. The security of the data is possible with the use of effective mechanisms.
There are various data mining and machine learning algorithms, which ensure effective processing and use of big data. Established objectives are the determining factor in the most effective algorithm depending on the situation of use. For particular business problems, there are particular algorithms (Hashmi and Ahmad 3). Newly designed algorithms are also coming up for the purpose of augmenting the prevailing ones or to bring up new ways of performing processing. Thus, depending on the context of use, some algorithms are better than others.
3.4.1 Conventional algorithms
Data mining algorithms, which have been used traditionally, such as k-means clustering and Naïve Bayes classifier, can still be useful but they are not effective in the current big data environment. Indeed, the reason is based on the fact that the algorithms are only effective with static data, which the memory can contain (Hashmi and Ahmad 3). This is no longer the case in the current environment where data is constantly on the move, mostly in networked environments. Thus, there has been a need for modification of the conventional algorithms to be effective in handling data in transit and the limitations in the memory.
The boundary classification for data streams is always changing as more data are received. Situations abound where recent examples are the only ones necessary to build the drift-concept instead of whole streams of data. In such situations, the solution could be the use of weighing policy or sliding window concept (Hashmi and Ahmad 3). In the latter, the consideration is only for a particular level of recent data, while the older data is discarded. In the former, weights are assigned to elements of data in such a way that new aspects of data are given higher weights while the older ones are given lower weight.
3.4.2 Weka and R
Advanced tools are being used in the mining of big data, some of which have been modified because they cannot be used as they are conventional. Such algorithms are Weka and R, although they have to be modified for application in the mining and analysis of big data. One such modification is a Weka-like tool, the Massive Online Analysis (MOA), and the state-of-the-art algorithms for data mining used in huge amounts of data. The algorithm is also usable in data streams’ incremental learning. “Rmpi” is the modification of R, which can utilize the package in handling huge data sets or “streams.” In fact, Message-Passing Interface (MPI) is provided in R language by Rmpi. The algorithm provides the potential for better stream mining than the traditional R algorithm (Diaconita 984).
3.4.3 K-Anonymity
Data mining is critical for organizations as there is increased the dependence on data for operations. Mining of big data is as important as data mining has been over the years. However, an added complexity emerges because the big data is vast and requires a more complex environment to operate. Unlike the usual data, in processing big data, there is a need for distributed environment given the characteristics of such data including volume, velocity, and variety. The reason for this is the fact that such data require parallel power in the processing in distributed programming frameworks such as Hadoop, among other approaches. The file system that is related to such processing is Hadoop Distributed File System (HDFS), which allows for processing in the distributed environment (Radhika and Kumari 2).
Hadoop has been utilized in such environments (Hashmi and Ahmad 2). K-Anonymity is an effective algorithm for use within the MapReduce programming paradigm, while the anonymization algorithm has been shown in the study by Hashmi and Ahmad to work in the mining of big data. With the increase in the reliance on cloud computing, organizations have to deal with vast data, and without necessarily having to be in control over the underlying infrastructure. Just like computing services are provided as a service over the network, so can this algorithm, ensuring success in protecting the large amounts of data. MapReduce is the programming paradigm applicable to the processing of data within the distributed environments. The paradigm has the potential for exploiting the Graphical Processing Units (GPUs’) parallel processing power.
3.4.4 Cluster algorithm
Security and protection of anonymity in the use of big data is one of the main concerns in its use. Protection of anonymity is the central focus in the technology involved in big data processing. While still in the improvement stage, the methods involved in the protection of the data and privacy will come a long way in ensuring the successful use of big data for the benefit of organizations. The proposed solution to the problem of data being obtained by attackers and used for malicious purposes is the K-member clustering algorithm. K-anonymity algorithm and L-diversity algorithm has been proven effective in studies to ensure privacy in the diversity and use of big data (Yin et al. 4). These anonymity models can be clustered and viewed as the same problem for the purpose of ensuring anonymity, and thus, security of the big data. Processing data using the cluster model will ensure that regardless of the volume of data that the organization has to process and use, there is a high level of security. There are major advantages in the use of this model, including reducing the execution time for the algorithm and reducing the chances of information loss while processing. It also has the advantage of making efficient the generalization process, a method that is particularly important in big data processing.
Cluster algorithms are effective in establishing relationships within the dataset of the organization. The algorithm can be utilized in finding types of classifications within the customer base, or in making the decision about the services and customers to put under the same group (Yin et al. 4). Compared to other types of learning approaches, the cluster algorithm has many different benefits among them being the way new applications are established in creating connections between different groups of data within the organization.
3.5 Managing Big Data
Two traditional architectures can still be used to achieve effectiveness in the management of big data. The two are Hadoop ecosystem and HPCC system.
3.5.1 Hadoop Ecosystem
Doug Cutting designed the algorithm as two main services. The first one was Hadoop Distributed File System (HDFS), while the second one was Hadoop MapReduce. The former is a dependable file system in a distributed environment, while Hadoop MapReduce is a parallel data processing engine with high-performance. When combined, the two have proven effective in processing large data sets on huge hardware clusters. The technique is effective in settings where the server is able to collect vast amounts of data from diverse sources. By using Hadoop, there is the reduction of time and effort necessary for loading data into other systems. MapReduce is mostly used and has many varieties of application depending on the environment where the data analysis is necessary (Abdrabo et al. 3). The technique has proven versatile and adaptable for analysis of huge amounts of data. It is also capable of universal reporting following the analysis of data.
3.5.2 HPCC Systems
Indeed, two clusters make up the HPCC Systems, which include Thor and Roxie. In fact, this architecture is the most commonly used middleware components, applied for data communication in external layers. Client interfaces provide suitable end-user advantages as well as devices for framework administration. Checking and improving stacking are provided by assistant parts, which store information on file system beginning with Outside wellsprings. Thor is the one responsible for changing, connecting, and indexing big data. The capacity to deal with a huge capacity of data makes this algorithm capable of analyzing big data. The inquiry Cluster, Roxie is responsible for displaying distinct high-octane and information warehouse competencies (Abdrabo et al. 3). Enterprise control Language (ECL) has the capability for the modification of dialect suitable for dealing with huge data. It allows for easier calculation and computation of high streams of data.
3.6 Big data techniques
In big data, there is use of special techniques in the processing of the huge amounts of data. The reason for this is that the huge amount of data is difficult to handle using the conventional database management tools. Thus, there are various techniques which are useful in this case.
3.6.1 Sampling technique
To deal with the large amounts of data streams, the most effective technique is sampling. In this case, there is use of samples of data from the database, instead of using whole data, in such a way that the approximated statistical measures are achieved. The sampling method makes it possible for the streams of data to be effectively analyzed without having to deal with the whole data which can be overwhelming. Some of the commonly used methods of sampling are Hoeffding bounds and Reservoir sampling (Hashmi and Ahmad, 4).
3.6.2 Incremental learning
Incremental learning is a method which is applied in mining of huge amounts of high velocity data. In generation of the model, incremental learning tends to process an instance at a go. Reading of the instance is normally from the input file or model (Hashmi and Ahmad, 4). Some of the common examples of incremental algorithms are Naive Bayes, multi-layer neural networks with stochastic back propagation.
3.6.3 Parallelization
There is a challenge involved in the use of incremental learning for data mining and analysis in big data. The challenge is in the speed, as the process is normally slow. Parallelization is used in solving this problem, where the data is split and processed in a different processor and then the outcomes are combined using averaging or voting to provide the eventual outcome. Sub-trees of the complete tree are built within the processor using this method to make the process easier and faster. Some of the commonly used parallel algorithms are bagging and stacking. However, these are not the only techniques because it is possible to modify other common algorithms incrementally to achieve effectiveness in the mining and analysis of big data.
4.0 Conclusion
Big data greatly affect competition in business organizations. In fact, the manner in which big data is used in the organizations determines the degree to which the firms outshine their competitors in similar markets. Big data has had great benefits for commercial organizations since it allows the firms to analyze and predict consumer behavior. Consumers are more likely to develop loyalty for health care organizations that make optimal use of big data to improve patient outcomes. It is very important for actors in whatever industry to come up with strategic marketing responses to enable the health care firm to respond to the impacts of big data on the nature of competition. One such response is the utilization of big data to improve the decision-making processes in the organization.
There are current algorithms and techniques that are effective in the mining and processing of huge data. As such, they are being used in the modern distributed environment with a lot of success. However, the processes that are involved in the generation of the huge data are not static. The changes in the processes and the environment suggest the need for change in the current algorithms and techniques involved in data mining and analysis. Thus, future research should focus on the development of effective algorithms and techniques depending on the changes experienced in data mining and analysis. Research around the use of big data should proceed at the same rate with research on the means of mining and analyzing it.
Works Cited
Abdrabo, Mai, et al. “Enhancing Big Data Value Using Knowledge Discovery Techniques.” (2016).
Bolin, Göran, and Jonas Andersson Schwarz. “Heuristics of the algorithm: Big Data, user interpretation and institutional translation.” Big Data & Society vol. 2, no.2 (2015): 2053951715608406.
Diaconita, Vlad. “Processing unstructured documents and social media using Big Data techniques.” Economic Research-Ekonomska Istraživanja, vol. 28, no. 1, 2015: 981-993.
Forbes Insights and Rocket Fuel . The Big Potential of Big Data. A Field Guide for CMOS. Forbes, October 24th 2013
González, Roberto J. “Seeing into hearts and minds: Part 2.‘Big data’, algorithms, and computational counterinsurgency.” Anthropology Today vol. 31, no.4 (2015): 13-18.
Hashmi, Adeel Shiraz, and Tanvir Ahmad. “Big Data Mining Techniques.” Indian Journal of Science and Technology 9.37 (2016).
Institute for Health technology Transformation (n.d). Transforming Health Care through Big
Data: Strategies for Leveraging Big Data in the Health Care Industry. Available Online at: http://assets.fiercemarkets.com/public/newsletter/fiercehealthit/iht2bigdata.pdf
Janssen, Marijn, and George Kuk. “The challenges and limits of big data algorithms in technocratic governance.” Government Information Quarterly vol. 33 no. 3 (2016): 371-377.
Manyika, James, et al. “Big data: The next frontier for innovation, competition, and productivity.” Insights and Publications, McKinsey Global Institute, (2011).
Ranjan, Rajiv, et al. “Advances in methods and techniques for processing streaming big data in datacentre clouds.” IEEE Transactions on Emerging Topics in Computing vol. 4, no.2 (2016): 262-265.
Radhika, D., and D. Aruna Kumari. “A Framework for Exploring Algorithms for Big Data Mining.” Indian Journal of Science and Technology vol. 9, no.17 (2016).
Yin, Chunyong, et al. “An improved anonymity model for big data security based on clustering algorithm.” Concurrency and Computation: Practice and Experience (2016).