druid/docs/content/Concepts-and-Terminology.md

---
layout: doc_page
---
Concepts and Terminology
========================

The following definitions are given with respect to the Druid data store. They are intended to help you better understand the Druid documentation, where these  terms and concepts occur.

More definitions are available on the [design page](Design.html).

* **Aggregation**&nbsp;&nbsp;The summarizing of data meeting certain specifications. Druid aggregates [timeseries data](#timeseries), which in effect compacts the data. Time intervals (set in configuration) are used to create buckets, while [timestamps](#timestamp) determine which buckets data aggregated in.

* **Aggregators**&nbsp;&nbsp;A mechanism for combining records during realtime incremental indexing, Hadoop batch indexing, and in queries.

* **Compute node**&nbsp;&nbsp;Obsolete name for a [Historical node](Historical.html).

* **DataSource**&nbsp;&nbsp;A table-like view of data; specified in [specFiles](#specfile) and in queries. A dataSource specifies the source of data being ingested and ultimately stored in [segments](#segment).

* **Dimensions**&nbsp;&nbsp;Aspects or categories of data, such as languages or locations. For example, with *language* and *country* as the type of dimension, values could be "English" or "Mandarin" for language, or "USA" or "China" for country. In Druid, dimensions can serve as filters for narrowing down hits (for example, language = "English" or country = "China").

* **Ephemeral Node**&nbsp;&nbsp;A Zookeeper node (or "znode") that exists for as long as the session that created the znode is active. More info [here](http://zookeeper.apache.org/doc/r3.2.1/zookeeperProgrammers.html#Ephemeral+Nodes). In a Druid cluster, ephemeral nodes are typically used for commands (such as assigning [segments](#segment) to certain nodes).

* **Granularity**&nbsp;&nbsp;The time interval corresponding to aggregation by time. Druid configuration settings specify the granularity of [timestamp](#timestamp) buckets in a [segment](#segment) (for example, by minute or by hour), as well as the granularity of the segment itself. The latter is essentially the overall range of absolute time covered by the segment. In queries, granularity settings control the summarization of findings.

* **Ingestion**&nbsp;&nbsp;The pulling and initial storing and processing of data. Druid supports realtime and batch ingestion of data, and applies indexing in both cases.

* **Master node**&nbsp;&nbsp;Obsolete name for a [Coordinator node](Coordinator.html).

* **Metrics**&nbsp;&nbsp;Countable data that can be aggregated. Metrics, for example, can be the number of visitors to a website, number of tweets per day, or average revenue.

* **Rollup**&nbsp;&nbsp;The aggregation of data that occurs at one or more stages, based on settings in a [configuration file](#specfile). 

  <a name="segment"></a>
* **Segment**&nbsp;&nbsp;A collection of (internal) records that are stored and processed together. Druid chunks data into segments representing a time interval, and these are stored and manipulated in the cluster.

* **Shard**&nbsp;&nbsp;A sub-partition of the data, allowing multiple [segments](#segment) to represent the data in a certain time interval. Sharding occurs along time partitions to better handle amounts of data that exceed certain limits on segment size, although sharding along dimensions may also occur to optimize efficiency.

  <a name="specfile"></a>
* **specFile**&nbsp;&nbsp;The specification for services in JSON format; see [Realtime](Realtime.html) and [Batch-ingestion](Batch-ingestion.html)

  <a name="timeseries"></a>
* **Timeseries Data**&nbsp;&nbsp;Data points which are ordered in time. The closing value of a financial index or the number of tweets per hour with a certain hashtag are examples of timeseries data.

  <a name="timestamp"></a>
* **Timestamp**&nbsp;&nbsp;An absolute position on a timeline, given in a standard alpha-numerical format such as with UTC time. [Timeseries data](#timeseries) points can be ordered by timestamp, and in Druid, they are.
Added prepend tag to make pages display. 2013-09-16 17:49:36 -04:00			`---`
Docs working 2013-09-26 19:22:28 -04:00			`layout: doc_page`
Added prepend tag to make pages display. 2013-09-16 17:49:36 -04:00			`---`
Add docs from github wiki 2013-09-13 18:20:39 -04:00			`Concepts and Terminology`
			`========================`

Refactor to remove sections and order entries. 2013-11-21 14:04:22 -05:00			`The following definitions are given with respect to the Druid data store. They are intended to help you better understand the Druid documentation, where these terms and concepts occur.`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
* removed terms that are widespread -- such as "column" and "interval" -- since there are too many ways to define them, in favor of allowing them to be defined in context on pages where they occur. For example, "interval" shows up in the JSON blob containing segment metadata, as well as in queries. * removed sections -- it's hard for a new user to understand exactly why these terms are in those sections, or what that means, so better to allow specific definitions or the context in pages to impart those relationships. * Added links between entries 2013-11-21 18:10:44 -05:00			`More definitions are available on the [design page](Design.html).`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Aggregation  The summarizing of data meeting certain specifications. Druid aggregates [timeseries data](#timeseries), which in effect compacts the data. Time intervals (set in configuration) are used to create buckets, while [timestamps](#timestamp) determine which buckets data aggregated in.`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Aggregators  A mechanism for combining records during realtime incremental indexing, Hadoop batch indexing, and in queries.`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Compute node  Obsolete name for a [Historical node](Historical.html).`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* DataSource  A table-like view of data; specified in [specFiles](#specfile) and in queries. A dataSource specifies the source of data being ingested and ultimately stored in [segments](#segment).`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Dimensions  Aspects or categories of data, such as languages or locations. For example, with language and country as the type of dimension, values could be "English" or "Mandarin" for language, or "USA" or "China" for country. In Druid, dimensions can serve as filters for narrowing down hits (for example, language = "English" or country = "China").`
added definition for ephemeral node, a ZK concept that occurs in the docs 2013-11-27 15:13:31 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Ephemeral Node  A Zookeeper node (or "znode") that exists for as long as the session that created the znode is active. More info [here](http://zookeeper.apache.org/doc/r3.2.1/zookeeperProgrammers.html#Ephemeral+Nodes). In a Druid cluster, ephemeral nodes are typically used for commands (such as assigning [segments](#segment) to certain nodes).`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Granularity  The time interval corresponding to aggregation by time. Druid configuration settings specify the granularity of [timestamp](#timestamp) buckets in a [segment](#segment) (for example, by minute or by hour), as well as the granularity of the segment itself. The latter is essentially the overall range of absolute time covered by the segment. In queries, granularity settings control the summarization of findings.`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Ingestion  The pulling and initial storing and processing of data. Druid supports realtime and batch ingestion of data, and applies indexing in both cases.`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Master node  Obsolete name for a [Coordinator node](Coordinator.html).`

			`* Metrics  Countable data that can be aggregated. Metrics, for example, can be the number of visitors to a website, number of tweets per day, or average revenue.`

simple typo fix on Concepts-and_Terminology.md to correct navigation to spec file info 2015-03-07 12:29:24 -05:00			`* Rollup  The aggregation of data that occurs at one or more stages, based on settings in a [configuration file](#specfile).`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
fixed formatting re bullets preceded by html <a> tag 2013-11-26 19:30:32 -05:00			`<a name="segment"></a>`
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Segment  A collection of (internal) records that are stored and processed together. Druid chunks data into segments representing a time interval, and these are stored and manipulated in the cluster.`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Shard  A sub-partition of the data, allowing multiple [segments](#segment) to represent the data in a certain time interval. Sharding occurs along time partitions to better handle amounts of data that exceed certain limits on segment size, although sharding along dimensions may also occur to optimize efficiency.`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
fixed formatting re bullets preceded by html <a> tag 2013-11-26 19:30:32 -05:00			`<a name="specfile"></a>`
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* specFile  The specification for services in JSON format; see [Realtime](Realtime.html) and [Batch-ingestion](Batch-ingestion.html)`
Refactored with new and updated sections and definitions. 2013-11-21 00:32:41 -05:00
fixed formatting re bullets preceded by html <a> tag 2013-11-26 19:30:32 -05:00			`<a name="timeseries"></a>`
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Timeseries Data  Data points which are ordered in time. The closing value of a financial index or the number of tweets per hour with a certain hashtag are examples of timeseries data.`
Refactor to remove sections and order entries. 2013-11-21 14:04:22 -05:00
fixed formatting re bullets preceded by html <a> tag 2013-11-26 19:30:32 -05:00			`<a name="timestamp"></a>`
added definitions for master and compute nodes; tweaked spacing between term and definition 2013-12-18 20:35:30 -05:00			`* Timestamp  An absolute position on a timeline, given in a standard alpha-numerical format such as with UTC time. [Timeseries data](#timeseries) points can be ordered by timestamp, and in Druid, they are.`