What is a data swamp?

In a previous blog, we talked about how one of the benefits of a data lake is being able to explore a large variety of data. We can land whatever data we want in this structure, and there aren’t any rules for how to store it.

Just like how you may have your own way of organizing folders on your computer, the data lake offers different options. You can create a series of containers, folders, sub-folders, and so on because you can put any kind of data into the data lake. However, to get it out again, that exploration requires some guidance based on how you structure and store your data. Without a good plan, you get a data swamp -which basically means you have stuff all over the place (maybe different versions, or in different partitions), and it can be difficult to find anything.

A lot of thought can go into how you logically organize your data. If you don’t put some upfront thought into that, you end up with files everywhere and no organization system. And that starts to diminish the value of your data lake.

How do you avoid the data swamp?

Other than just having a plan, what are some steps to take to prevent ending up with a data swamp? One of the first ways to avoid a data swamp is starting small. Don’t try to take on the world in your initial planning process. You should absolutely have an initial plan about what your structure should be, but it’s also important to plan time around refactoring. As soon as you introduce additional elements and requirements, your structure is going to change.

Don’t expect to get it perfect right away. Plan to be agile, knowing you will want to refactor it along the way. Even if you initially get it 100% right, the structure of your data lake is an ongoing thing that will continue to evolve. Just planning for that refactoring time will help you keep the data lake useful to your various users.

How do I plan our data lake structure?

When getting started with the structure for your data lake, ideally it would be requirements-driven. If you have a new Big Data disruptor to your current data warehouse (for example, new IoT data, JSON files, log files, etc.), I would just start taking those on individually. However, you also want to take a step back and start to think about what other types of data you might store and how to structure it.

The most difficult part of minimizing data swamp risk is finding the right balance between planning ahead for unknowns and realizing that (no matter how long you plan) there will be disruptors ahead. It’s not a one-and-done thing. Just make sure you allocate time to rethink your structure when it becomes necessary.

What’s the best data storage solution?

Data lakes work well for that initial landing and for some quick exploration of data, but our most successful clients often implement a mix of data lakes and data warehouses. From an analytical standpoint, we’re seeing the value of getting to a structured data model in a data warehouse as the next layer downstream from the data lake. This especially applies to Microsoft-centric technologies. Microsoft said they consider it best practice to have a physical relational data model that serves as the source for an analysis services tabular model or a Power BI data model.

That notion of going from data lake to data warehouse to tabular model to Power BI really is the best practice that we’re seeing in the industry. We’ve worked hard at Skyline Technologies to wrap some best practices around the implementation of that kind of a solution.

If balancing the needs of a data lake or data warehouse is one of the challenges you’re facing in your organization, our data team would welcome the opportunity to share more best practices with you and talk about this topic in greater detail. Feel free to contact us or subscribe to our blog for more practical insights.

Categories: Article


Leave a Reply

Your email address will not be published. Required fields are marked *