Skip to content

How to Empty a Datalake?

In an era where lakes are drying up and the climate emergency calls for a reevaluation of our consumption habits, we regularly clear our email inboxes but continue to fill servers with personal and professional data, whether temporary or permanent, valuable or transient, warm or cold. With the world’s reliance on resources and energy for these purposes and the importance of digital sobriety, how can we effectively empty a Datalake, clean up, and reduce its financial and environmental impacts?

While half of the world’s lakes and tanks, crucial in carbon storage, are shrinking due to rising temperatures, human activity, and reduced precipitation, 53% of the planet’s largest lakes have seen a significant decline in water levels from 1972 to 2020 [1].

Simultaneously, with the proliferation of connected devices, cloud computing, and more, the volume of digital data created or replicated globally continues to soar at a breathtaking rate, multiplying by thirty between 2010 and 2020 and projected to grow at an annual rate of around +40% until 2025 [2]. “Dark data,” or unused, unknown, and unexploited data generated by users’ daily interactions, accounts for 52% of the world’s stored data [3].

Computer data lakes (Datalakes), which facilitate data governance strategies within companies, keep accumulating data. While they address the need to economically leverage rapidly expanding data volumes, Datalakes are energy-intensive. The primary challenge is the storage of unnecessary or obsolete data. In businesses, the volume of data they manage doubles every two years [4]. On average, between 60% and 73% of all a company’s data is not used for analysis purposes [5]. Business-generated cold data amounts to 1.3 billion gigabytes, equivalent to 1.3 billion high-definition DVDs! [6]

In the face of the environmental crisis, Net Zero commitments are no longer sufficient; we must identify and activate all the levers to reduce our environmental footprint as soon as possible. But in businesses, as well as in personal settings, while we understand why we must do it, we often don’t know how or where to start. The “we keep it because you never know!” attitude not only harms the environment but also has a radically detrimental impact on companies’ finances. For example, the IDC research firm estimates that “dark data” costs global businesses €2 billion each month.

How can we quantify the volume of dormant data in comparison to usable and utilized data? What data unnecessarily clogs up storage space, consumes energy in vain, and drastically increases costs?

One of the first steps is to address data management within IT:

  • Governance: Who is responsible for the data, who has the right to add or remove data, what is the distribution of responsibilities and associated rights? How can users be engaged at every link in the chain (training and awareness issues)? The percentage of large global companies with a Chief Data Officer (CDO) reached 27% in 2022, up from 21% the previous year. This role is particularly common in Europe, where over 40% of large European companies have appointed a CDO to manage data [7].
  • Skills: Ingesting less data involves optimizing resources while considering data usage context. Key roles in these efforts include Data Architects, Data Engineers, Data Scientists, and Data Platform Engineers, with a strong emphasis on raising awareness among all stakeholders for a clear understanding of the issues.
  • Corporate culture: How is the need qualification between business and IT managed? How are projects handled? Is there a culture of economy and sobriety? How do the company’s commitments reach the operational level? Is controlling environmental costs a sufficient lever for cleaning up, or is a purely financial approach preferable, even if it means counting environmental gains as a “bonus”?
  • Storage methods: They must consider both business use cases (updates, access, etc.) and regulatory (GDPR) and financial (cost limits) considerations. Cold data can be a lever to reduce impacts and costs, enabling longer access to data by the business. Deletion is not a habit; we have a collector’s reflex, like children with a bag of marbles! The cost difference between hot and archival data can vary significantly between storage providers (from 1 to 20).
  • Technologies: Unequal in efficiency, they can also encourage consumption. Projects are initiated, migrations are made, but do we know how to decommission? What is the impact of a data platform? What are the technical characteristics that enable optimization?
  • Continuous improvement through monitoring: We can track the absolute value of storage and its growth, raising questions about the decoupling or correlation between data storage growth and added value for the business. Or the ratio between stored and used data.
  • Technical debt as a strong constraint: Technical debt costs would represent between 10% and 20% of new project expenses[8]. Data accumulates over time, and tidying up means taking care of the “legacy.” It is entirely possible to integrate this constraint into the daily management of a datalake.
  • Costs: For most companies, this cold data was neglected – its storage cost seemed “reasonable.” But with soaring electricity prices, storage costs are rising, and this factor can no longer be underestimated or ignored. The balance now includes financial costs, environmental costs, and the ever-increasing cost of labor.

Once we have identified all the parameters that should be taken into account for responsible data management, we must add a vision of what data management should evolve into:

Absolute value decline: The growth of digital impacts (carbon emissions from digital technology in France could triple by 2050, source: Arcep 2023) cannot and should not be “infinite.” One of tomorrow’s challenges will be to provide digital services with an environmental impact that grows more slowly than the proposed uses. This may involve technological developments and a selection of use cases based on their potential environmental, social, or societal impacts.

Stable operating model: It is essential to define processes at the datalake’s entry, from data qualification to ingestion in the correct format for use (sorting between cold data that should remain cold, usable cold data, and cold data to be permanently deleted), providing a framework for suppliers and consumers, and establishing a storage technology watch.

Data lifecycle phases: 

  • Ingestion: once data is there, it stays; so, the ingestion mode is a lever to limit growth. 
  • Data cleaning: temporary table cleaning and defining a lifespan for each piece of data after which it is automatically deleted. 
  • Data exposure (also known as datamesh): allows data to be exposed to other teams and fully leveraged with optimized storage in one place. 
  • Data cleaning: How to implement automatic cleaning processes, with suitable monitoring and alert systems for tracking?

CIOs have an increasingly significant role to play in achieving a company’s environmental goals, responding to RSE approaches’ industrialization while reducing the environmental impact of their own assets. Fortunately, many principles and best practices in tech exist to help CIOs reduce their impacts and empower CSOs.

Sources:

[1] According to the latest study published in Science

[2] Statista

[3] According to a study by “Le GreenIT”

[4] Study conducted by the Enterprise Strategy Group (ESG) for MEGA International on “The Strategic Role of Data Governance and Its Evolution,” October 2022

[5] Forrester, 2016

[7] Statista, Mars 2023 + Statista, Février 2023

[8] Source McKinsey,  Juillet 2020

Source : https://www.linode.com/content/cloud-block-storage-benchmarks/