Data Lake is nothing more than the name implies. It is a repository responsible for storing a huge amount of the most varied kinds of data.
Imagine Data Lake as your collection of games from the most different genres and platforms. You can choose to organize them any way you like – do you want to leave everything lying around? You can. Want to sort by release date? You can. Want to organize by box art colors? You can. Want to organize by genre and/or platform? You can – in this example of the game collection, you don’t quite know the final scope of your collection (after all, the idea is that you buy, and even sell, more games), but have a space to store all the titles that you have in one place is a hand on the wheel when that urge to play something hits.
What was described above can be equated to a Data Lake, with the difference that, instead of games, the most diverse data is stored, the most diverse information is stored, to be accessed whenever you need it during a project.
For Business Intelligence, especially for those starting in the area, maintaining a Data Lake is essential as it is relatively simpler than a Data Warehouse and more comprehensive than a Data Mart.
The data ecosystem is currently one where a lot of care is taken to prepare data. Whether by Artificial Intelligence, or by amazing B.I. tools, etc, to create these extremely sophisticated data storage centers. The proposal to create a Data Lake goes against this notion, requiring less of the hardware used and allowing less time to be spent preparing this data for hands-on work, devoting more time to analyzing the data inserted in the Data Lake.
But now let’s get down to business, what exactly should we store in a Data Lake?
Well, all relevant data for your company and for your future projects! So it’s a joint effort to keep it organized and fed with relevant data and yes, the criticism of the Data lake of “but then does it mean that we’ll have stored data of different qualities and no polish at all?”, which must be running through your head now, is valid. In fact it does, and it is the data scientist’s job to analyze and identify which type of data is relevant to each project and to analyze it carefully before handling it.
“But wow, it must be a mess to have so much different data in one place, right???” Calm down, young padawan, the question is pertinent, but you’re thinking that being B.I’s Pokémon master is easy? We always have to be prepared!
One tip I offer is to use and abuse metadata. You don’t have to be a B.I. master to make a Data Lake look tidy, but you have to be careful about tagging. Use tags for everything you store.
Another wisdom I leave with you is the following: ALWAYS be careful and organized. Working with B.I. it’s not just about making wonderful reports, graphs and spreadsheets, your responsibility is HUGE, since it is the information and recommendations that result from your work that will guide people’s decision-making. A mistake by carelessness, or because you saved something in your Data Lake that ended up lost, or whatever, can damage the client and worse, the credibility of your work as a B.I.! So be VERY careful when creating your Data Lake. Take good care of him and he will take good care of you in the future.
Read all our Business Intelligence Posts
“Oh, Quinho, but then just throw any data there and keep it organized and it works!” Wait a moment. More seriously now, building a Data Lake is serious business. It’s not enough to just create a repository and feed it with whatever comes into your head, Data Lakes must follow rigorous processes to ensure the necessary data security so that you don’t end up using untrue data in your work. So be very careful what you include.
Despite being wonderful, Data Lake is still much more hyped than it actually is. Remember that it is not the ultimate solution to your data storage problems and that it requires a LOT of work to work as expected. Take it easy and don’t take a step bigger than your leg, respect your time. So, who knows, you can get into this lake without drowning.
In direct comparison to a Data Warehouse, for example, Data Lakes have some clear advantages, especially with regard to the volume of accumulated data, since they are raw data, rather than treated as in DW, less investment and, of course, , flexibility, since the Data Lake does not use a pre-established model, such as the Data Warehouse.
An example to clearly clarify the difference between Data Warehouse and Data Lake is the classic example of trends in Game Design:
What genre will your game be? What are the genres that have generated the most revenue in the last 10 years?
To answer this question by consulting a Data Lake, you’ll find absurd amounts of data on everything that relates to game genres in raw form, from what was trending 10 years ago to game pricing by genre or benchmark of major competitors by genre (in terms of performance, game duration, graphic style, among others) and, with the help of software (such as the aforementioned Hadoop and Open Refine) for analyzing this data, it is possible to extract the answer to your question and include tangential data, which will probably be relevant later in your project, such as, for example, which style of music has synergy in each genre, etc… With a Data Warehouse, parameters are set up so that this answer already comes with data fed more judiciously by the software that is storing them, let’s say that a Data Warehouse is more expensive, more rigid and capable of providing more complex and deeper answers than a Data Lake.
In short, Data Lake is essential in any BI project, as it’s a repository for everything you’ll glean during your data scraping and other data harvesting techniques. It is essential in the sense of acting as a facilitator for all your present and future research, but it needs a lot of work to maintain as the data lake alone will not be able to achieve the objective of assisting in the research.
Want to learn more about Business Intelligence, market analysis, game-oriented marketing and more? Follow the #GamePlanCompass here and contact us! GamePlan is always open to pointing out the most diverse paths for the most different needs!
Imagem de Wilfried Pohnke por Pixabay