Article/BlogLatest Post

The Fight for Controlled Freedom of the Data Warehouse

92views

The Fight for Controlled Freedom of the Data Warehouse

Barr Moses   Oct.17.2022

The data gatekeeper is dead, long live the…oh no what have we done?

A silent alarm rings in my head whenever I hear someone utter the phrase, “data is everyone’s responsibility.”

You can just as easily translate this to mean that “data is no one’s responsibility,” too. This, readers, is what I call the “data tragedy of the commons.”

Here’s why.

The term tragedy of the commons comes from economic science and refers to situations where a common set of resources is accessed without any strong regulations or guardrails to curtail abuse.

As companies ingest more data and expand access to it, your beautifully curated warehouse (or lakehouse, we don’t need to discriminate) can slowly turn into a data swamp as a result of this unfortunate reality of human behavior.

To use another adage… having too many cooks in the kitchen spoils the broth. Having too many admins in the Looker instance leads to deletions, duplications, and a whole host of other issues.

So, how can we fix this data tragedy of the commons?

In my opinion, the answer is to give data consumers, and even members of the data team when appropriate, guardrails or “controlled freedom.” And in order to do this, teams should consider four key strategies.

Remove the gatekeeper but keep the gate

Don’t get me wrong: everyone should care about data quality, security, governance, and usability. And I believe, on a fundamental level, they do. But I also believe they have a different set of incentives.

So the documentation they are supposed to add, the catalog they need to update, the unit test they should code, the naming convention they should use–those things all go out the window. They aren’t being malicious, they just have a deadline.

Still, the tradeoff of faster data access is often a poorly maintained data ecosystem. We never, of course, want to disincentivize the use of data. Data democratization and literacy are important. But the consequences of sprawl are painful, even if not immediately felt.

All too often, data teams attempt to solve the “data democratization problem” by throwing technology at it. This rarely works. However, while technology can’t always prevent the diffusion of responsibility, it can help align incentives. And the end user or data consumer is most incentivized when they need access to the resource.

We don’t want to impede access or slow organizational velocity down with centralized IT ticket systems or human gatekeepers, but it’s reasonable to request they eat some vegetables before moving to dessert.

For example, data contracts add a little bit of friction early in the process by asking the data consumer to define what they need before any new data gets piped into the warehouse (sometimes in collaboration with the data producer).

Once this has been defined, the infrastructure and resources can be automatically provisioned as Andrew Jones with GoCardless has laid out in one of his previous articles. Convoy uses a similar process as their Head of Data Platform, Chad Sanderson, has detailed.

The next challenge for these data contract systems will be with change management and expiration. Once the data consumer already has the resources, what is the incentive to deprovision them when they are no longer needed?

Read More …

Leave a Response