Business data is valuable and warrants significant consideration when deciding where and how to store it. First I will describe the primary considerations for business data storage for analytics. Then I will use the considerations to compare standard data storage technologies to reach a conclusion about the best data storage technology available for business data analytics.
We begin by discussing considerations for business data storage for analytics.
Data access control
Business data should be access controlled. Only the specified parties should have access to business data.
Certain businesses may require fine-grained access control. For example, a customer should have access to their purchase history, but not the purchase history of other customers. This holds even if there is a single table for all purchases.
Therefore, business data must be stored such that access can be controlled at a fine level of granularity without having schema design implications. Typically there should not be one data table per customer, instead, there should be data access control such that each customer can access certain parts of the customer table.
Accessibility and compatibility
Business data should be easy to access. Business data should be compatible with standard data visualization and business intelligence tools.
SQL is the standard data access query language. It is used by almost all business intelligence tools. Most data analysts and many developers have experience accessing data with SQL. Other data access tools are used, but they are typically used to augment SQL access, not replace it.
Therefore, business data should be stored such that it can be queried easily using SQL. This makes the data accessible and compatible with most standard tools.
Scalability
Queries to answer basic business questions should be quick and inexpensive. Analytical queries should have no impact on services provided by production systems.
A key consideration for data ingestion and query scalability is data partitioning. A single storage system without indexing would require a complete sequential scan of all rows in order to find specific data. Effective partitioning allows queries to be optimized such that not all data needs to be scanned and the data that does need to be scanned can be read in parallel. Partitioning is not required when storing business data, but it makes scaling easier.
Therefore, business data must be stored such that data access is quick, inexpensive, and has negligible impact on production systems.
Compliance
Business data often contains sensitive user data. That data needs to be stored with a high level of security. Business data should be stored with strong security guarantees.
Business data in general should be stored for a limited period of time. This is legally required in some cases. Managing data retention is important and can become complex. As a result business data should be stored such that it can be automatically deleted by the storage system when it becomes too old.
Comparison of business data storage solutions
Now that the considerations for a business data storage system have been discussed, we can look at some specific solutions and see how they compare.
Transactional databases
Typically a business will have its data initially stored in a transactional database. When a customer makes an action that information will be immediately updated in the databases.
Transactional databases support SQL so they are compatible with visualization and business intelligence tools. They allow for fine-grained data access control through roles and row-level security. They allow for row-level TTLs (time to live) that can be used for compliance purposes.
However, they are designed to scale to a high level of writes and indexed reads. The business may need to answer a question like “How many purchases did customers over the age of 50 make in the last month?”. This question would be quite expensive to answer using a transactional database because the database is unlikely to have an index on customer age. Large queries on fields lacking indices will not scale well for transactional databases.
Columnar files
Business data can be stored on a distributed file system like Hadoop, Amazon S3 or GCP Blobstore. It can be stored using a columnar format like Apache Parquet. Apache Parquet allows for arbitrarily complex queries to be executed with good performance without indices. By storing business data as files, those files can be accessed by many users concurrently without putting additional load on the production system.
Querying columnar files can be efficient, but relies on a proper data partitioning strategy. A good data partitioning strategy is partitioning by day and by table. This means all data for a given day and table will be in its own folder on the file system. As long as the business data is somewhat evenly distributed across each day, any time-bounded query will be efficient because only the directories for the relevant days will be accessed. If you follow this partitioning scheme your files may look like:
/Customers/2022/04/03/data_1.parquet
/Purchases/2022/04/03/data_1.parquet
/Purchases/2022/04/03/data_2.parquet
OR
/Customers/2022-04-03/data_1.parquet
/Purchases/2022-04-03/data_1.parquet
/Purchases/2022-04-03/data_2.parquet
Most distributed file systems allow row-level access control on columnar files. Here is an article on how AWS S3 allows for this use case. Most distributed file systems allow files to be automatically deleted after a certain duration has passed.
Analytical databases
Analytical databases are an abstraction on top of columnar files. Under the hood, they use the same type of data storage as columnar files, but they ingest the data and provide access through a set of APIs.
When I look at the features for an analytical database service like AWS Redshift the main additional features I see that are not covered by columnar files are improved integration with other AWS services and automatic data partitioning suggestions.
Closing thoughts
Columnar files and analytical databases are good options for storing business data for analytics whereas transactional databases are not. This does not mean that transactional databases should not be used to store business data, it just means they have limited flexibility and may need to be used in parallel with other options.
Analytical databases provide more features than columnar files. However, analytical databases also incur a significant amount of vendor lock-in. If you become dependent on the features of a specific analytical database you may find it difficult to transition to a different analytical database later. By adding an abstraction on top of the data you lose visibility and control of the data organization. The abstraction of an analytical database introduces an additional learning curve as well as higher storage costs.
Columnar files are the best option for storing business data. They are simple and intuitive. They provide quick, scalable access to data and allow for arbitrary levels of query flexibility. Analytical databases are also a good option, but their additional complexity does not provide sufficient additional value to justify their downsides.
construction
butchery
storage