The collection of data on users of various websites is a topic that has existed since it became technically possible (that is, since the 1990s, when the first database-supported services and appropriate mechanisms such as cookies in browsers appeared), on the other hand it has only recently become common enough that appropriate regulations have begun to appear, which say what can and cannot be done (not to look far one can cite the example of GDPR/RODO). What's more, the awareness of "serial" Internet users has also begun to grow, which entails appropriate mechanisms related to privacy settings, for example. All this, however, are topics that primarily concern those directly involved in the operation of the company and adapting them to existing requirements, both legal and usable. But how does a programmer find himself in such a world? Does the exponentially increasing amount of information that is exchanged between the client and the server force or promote the right set of his skills. Finally -- if we boil it all down in a thick, cloud-based sauce will we come out with a virulent dish, or will we have to throw it all in the trash?
Personally, I'm a bit frightened by the challenges faced by current systems that more or less fit into the term Big Data. It's very easy to simplify the subject by saying that the only thing we have to deal with is the amount of data we will store. It quickly becomes apparent that the database is the least of our problems -- storage space is extremely cheap and there is no problem to hold terabytes of data for a few tens of dollars a month. For example -- storing 1TB of data in Azure Table Storage will cost us about $60. The bigger problem turns out to be accessing the stored information itself, when suddenly it turns out that on a small scale, reading everything we have is a thing that is both doable and estimable in terms of the time required. It will take a while, the end of the day, however, we will get some result that we more or less expected. If we add to this the Big Data factor, suddenly it turns out that mistakes at the level of systematization of data or its structures are costly not only at the level of our wallet, but also of the entire cloud solution. If for some reason we need to read all the records we have collected, we may fall into the trap of not being able to finish the process before, for example, the next iteration of daily reports, which will then perhaps have to operate on two sets of data, further destabilizing the architecture. This is not a scenario we would like to find ourselves in.
Emerging regulations are unlikely to improve the situation. In one project that my team and I are currently working on, we have been forced to significantly complicate the existing infrastructure from a theoretically trivial requirement that arose quite recently -- the need to return to the user all the information collected about him if he asks us to do so. The problem seems quite simple to solve -- just associate the user's ID with all the collected information and then write a simple query that pulls it out. This would be true if we could use, for example, a relational SQL database, but data collection at the scale of Big Data effectively kills such a solution. Taking this a step further -- we could produce views from the collected data (and this is an idea that is being used successfully to handle predetermined queries), but this is only half the battle. It turns out that in addition to returning information, a user can also ask us to delete everything we've learned about them. Admittedly, the process is slightly different for anonymizing data and deleting everything that is considered so-called personal data, but this complicates our solution even more.
All of the above-mentioned problems force the programmer to learn new skills in a fairly short period of time. It is also extremely helpful to have the support of an experienced data scientist, whose intuition can simplify the preparation of appropriate structures with data on which to work seamlessly later. This seems to be the starting point when we begin preparations for our Big Data platform -- familiarity with the cloud and its characteristics obviously helps a lot, but still, no matter how fast individual cloud services may be, if we make a serious mistake at the beginning, we may expose ourselves to unnecessary costs and stress later on. What skills are we talking about, though? I think it turns out to be important to have knowledge in two areas -- the architecture of distributed systems (where we often aggregate information from multiple sources -- they don't necessarily have a common context or schema) and techniques related to communication between different elements of our system. While the first point is often done for us (such as the Event Hub in Microsoft Azure, which implements the "partitioned consumers" pattern), the second requires reading a few articles, testing and gaining experience to choose the right solution. Sometimes it's possible to speed up a given element by several orders of magnitude (in terms of the number of processed events per second, for example), but it's hard to achieve this without knowing the right optimization techniques (like proper serialization and data partitioning).
The cloud allows us to play around with our architecture and quickly change its various components if the need arises. In all this, however, it happens that the problem is not the performance of a single component, but rather the characteristics that you don't think about in the first place. We are talking about such things as, for example, the bandwidth of the Internet connection or the time it takes to read or write data from the hard disk. Let's imagine a situation where we are trying to read all the data from our storage for processing. We have stored them in such a way as to allow them to be processed in parallel. However, all our efforts may be in vain if, for example, despite the fact that we can process 100MB of data per second, the storage is only able to read and send 10MB. At such a point, the choices are two -- either we change the storage to one that will harmonize with our computing power, or we have to store the data in such a way as to use 100% of the bandwidth it offers.
Despite all these adventures, however, the cloud is our friend when it comes to building Big Data solutions. Admittedly, the capabilities offered vary depending on which provider we use, but putting this together we notice a sizable array of different services that cover most, if not all, scenarios. So we get both components responsible for communication (Kinesis, Event Hub), stream processing (again Kinesis, Stream Analytics) and data storage (S3, Azure Storage). If we add all sorts of products from the "serverless" basket, it turns out that we can not only build our platform any way we want, but also amortize the costs. A huge advantage of the cloud is that the costs associated with it can be directly linked to the application lifecycle. From the programmer's (or architect's) point of view, this is an added advantage -- if I don't have to worry about the future of my infrastructure, my estimates don't have to be so "upscale" -- if necessary, I will replace or modify the appropriate component to meet the new requirements.
But how do I find my way around this whole corner of abundance? Doesn't the multitude of both commercial and free solutions work against the cloud? Personally, I think it's completely the other way around -- if I am free to choose the components of the architecture, I can not only easily match the team as well as the current workload. From a technical point of view, it's a bit of a challenge -- you have to make sure that at every stage you take care of the right data flow -- but from the side of the platform as a whole, it's a rather desirable situation. It is no longer a monolith that, once placed in production, causes embarrassment every time something needs to be changed. Modularity and specialization of each service mostly help rather than hinder.
What turns me on most about the combination of cloud and Big Data? The challenges to be faced. The problems of traditional systems can spend the night of your eyelids, their scale, however, is much smaller. In the world of data, on the one hand, you have to stick to some rigid rules in terms of how we need to store it and what we can do with it, on the other hand, the very core of the solution is usually undefined. In fact, from a programmer's point of view (or in general -- looking at Big Data from the purely technical side), this is one of the cooler areas to be in. It's no longer your typical desktop applications or reheated non-stop webservices. You sit in your chair, receive a few hundred MB of data and your code has to deal with it as quickly as possible. No more than in a second.