What is Druid?

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.
Druid is a tool in the Big Data Tools category of a tech stack.
Druid is an open source tool with GitHub stars and GitHub forks. Here’s a link to Druid 's open source repository on GitHub

Who uses Druid?

Companies
52 companies reportedly use Druid in their tech stacks, including Airbnb , Instacart , and Hepsiburada .

Developers
302 developers on StackShare have stated that they use Druid .

Druid Integrations

Zookeeper , Trino , strongDM , Metabase Cloud , and Querybook are some of the popular tools that integrate with Druid . Here's a list of all 7 tools that integrate with Druid .
Pros of Druid
15
Real Time Aggregations
6
Batch and Real-Time Ingestion
4
OLAP
3
OLAP + OLTP
2
Combining stream and historical analytics
1
OLTP
Decisions about Druid

Here are some stack decisions, common use cases and reviews by companies and developers who chose Druid in their tech stack.

My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage . I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.

The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.

I make a report based on the two files in Jupyter notebook and convert it to HTML.

  • Everything is done with vanilla python and Pandas .
  • sometimes I may get a different format of data
  • cloud service is Microsoft Azure .

What I'm considering is the following:

Get the data with Kafka or with native python, do the first processing, and store data in Druid , the second processing will be done with Apache Spark getting data from apache druid.

the intermediate states can be stored in druid too. and visualization would be with apache superset.

See more
Umair Iftikhar
Technical Architect at ERP Studio · | 3 upvotes · 331.3K views

Developing a solution that collects Telemetry Data from different devices, nearly 1000 devices minimum and maximum 12000. Each device is sending 2 packets in 1 second. This is time-series data, and this data definition and different reports are saved on PostgreSQL . Like Building information, maintenance records, etc. I want to know about the best solution. This data is required for Math and ML to run different algorithms. Also, data is raw without definitions and information stored in PostgreSQL. Initially, I went with TimescaleDB due to PostgreSQL support, but to increase in sites, I started facing many issues with timescale DB in terms of flexibility of storing data.

My major requirement is also the replication of the database for reporting and different purposes. You may also suggest other options other than Druid and Cassandra . But an open source solution is appreciated.

See more

Blog Posts

Dec 22 2021 at 5:41AM

Pinterest

MySQL Kafka Druid + 3
3
535
MySQL Kafka Apache Spark + 6
2
1856

Druid Alternatives & Comparisons

What are some alternatives to Druid ?
HBase
Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop.
MongoDB
MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.
Cassandra
Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
Prometheus
Prometheus is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine capable of storing data and searching it in near real time. Elasticsearch, Kibana, Beats and Logstash are the Elastic Stack (sometimes called the ELK Stack).
See all alternatives

Druid 's Followers
815 developers follow Druid to keep up with related blogs and decisions.