Data is growing at a quicker rate than ever before. By 2020, every individual online will produce roughly around 1.7 megabytes of new data every second of each day. On Google alone, users perform more than 40,000 search queries every second. However every second or a millisecond will result in mountains of lost data, each business needs a dedicated platform to capture and analyze data at these progressively rapid speeds. Tools that were useful for big data applications are no longer sufficient. When batch operations predominated, Hadoop could deal with most of an organization’s needs. Development in other IT areas have changed the manner in which data is collected, stored, distributed, processed and analyzed. Real-time decision needs have made this scenario complicated and new tools and architectures are expected to deal with these challenges efficiently. Given this consistently increasing amount of available data and the fact that most of it is live in the moment, the benefits of big data will lose if the information isn’t processed quickly enough. Here’s where the concept of “fast data” ventures up to the plate. What is “FAST DATA” ? Fast data is real-time data that typically comes from streaming and is analyzed quickly to make rapid business decisions.
When considering the 3 V’s of data: volume, velocity, and variety, for a while big data emphasized data volume and now fast data applications mean velocity and variety. Two tendencies have emerged from this evolution: first, the variety and velocity of data that enterprise needs for decision making continues to grow. This data includes not only transactional information, but also business data, IoT metrics, operational information, and application logs. Second, modern enterprises need to make those decisions in real time, based on all the collected data. This need is best clarified by looking at how modern shopping websites work. Websites are about what we can offer to a particular client at a precise moment. We need to know the “heat” zones of the page (the most clicked items) and why there are cold zones. In the event that an individual is viewing segment ‘A’ of the page, the system must offer segment ‘B’, since previous data had shown that numerous clients moved from segment A to segment B. The challenges faced are: collect all the data using the internet, clean it, process it, and analyze it using the internet. Second, based on this information changes need to be made to the web page immediately, so the analysis must be accurate and tied dynamically to the customer for a precise moment.
Designing Architectures For FAST DATA
Big data used to mean Hadoop and NoSQL databases. Hadoop was conceived in the “batch mode” and “off-line” processing era, when data was captured, stored and processed periodically with batch jobs. In those years, search engines worked by having data collected by web crawlers and then processed overnight to offer updated results the next day. Big data was centered around data capture and off-line batch mode operation. As the prior website example shows, modern “fast data” reduces the time between data arriving and data value extraction. Real-time event processing can be known as the opposite of batch data. In real-time fast data architectures, individual events are processed as they arrive, and the processing time is of milliseconds and even microseconds. Building fast data architectures that can do this sort of millisecond processing means utilizing systems and approaches that deliver timely and cost-efficient data processing focused on developer productivity. It’s also important to mention that architecture components must comply with the R’s: Reactive (scaling up and down based on demand), Resilient (against failures in all the distributed systems), and Responsive (even if the failures limit the ability to deliver services). Related to the R’s is the fact that fast data processing has also changed data center operations, so data center management has become a key tool in fast data architecture that meets the demand of real-time decision making. Modern fast data systems are composed of four transformation stages that provide the following data capabilities:
- Processing and analysis
- Presentation and visualization
In this step, data enters the system from diverse sources. The key focus of this stage is performance, as this step impacts of how much data the whole system can receive at any given point in time.
In the process of recording of information in a storage medium, the two perspectives that are thought of is;. Logical (i.e. the model) data storage. Physical data storage. The key focus for this stage is ‘experimentation’ and flexibility.
Quite a while back, there was discussion about whether big data systems should be (modern) stream processing or (traditional) batch processing. The right processing technique for fast data is that it must be hybrid, both batch and stream at the same time. The type of processing is now defined by the process itself and not by the tool. The key focus of this stage is ‘combination.’ Fast data tools must offer exceptional batch processing, excellent stream processing, or both at the same time to be competitive. The point is to use the right tool for the right task of processing. It is not recommended to use batch tools when online processing is needed. Some tools divide data into smaller pieces, where every chunk is processed independently in individual tasks. Some frameworks make batches in every data chunk, which is called micro batching, and another model is to process it like a single stream of data.
Visualization communicates data or information by encoding it as visual objects in graphs, to clearly and efficiently display information to users. This step should avoid processing. Some reports need runtime calculations. Data should always be in summarized output tables, with groupings that could be temporal (by time period) or categorical (based on business classes). Performance is better if reports are parallelized. Also, the use of caching always has a positive impact on the performance of the visualization layer.
Data center management
Even after having a solid architecture designed to handle the four data transformation stages, it’s important to recognize that modern fast data processing has also changed data center operations. Business is moving from specialized, proprietary, and expensive supercomputers to deployment in clusters made of commodity machines connected with a low- cost network. Now, the Total Cost of Ownership (TCO) determines the destiny, quality, and size of a data center. One common practice is to create a dedicated cluster for each technology because the overall TCO tends to increase. Examples for some of these clusters; a Kafka cluster, a Spark cluster, a Cassandra cluster etc. The tendency is to adopt open source and avoid two dependencies known as vendor lock-in and external entity support. Transparency is ensured through community-defined groups (Ex : Apache Software Foundation, Eclipse Foundation) which provide guidelines, infrastructure, and tooling for sustainable and fair technology development. Managing and using that data effectively, moving it through the stages of acquisition, storage, processing and analysis, and visualization will get you actionable insights and it will unlock real value in data.
Exposition Magazine Issue 15
Department of Industrial Management
University of Kelaniya