Discover your Metadata- Apache Atlas

Manjit Singh
8 min readAug 17, 2020

--

What’s the first thing you do when you receive a parcel? — You check the senders name and address.

What’s the first thing you look at when you get an email? — You check the senders name and ID.

What do you check while buying anything? — You check the product’s Price, Brand, Manufacturing and Expiry.

What all do you check before travelling? — Distance, Traffic and the route.

All the above examples show that even before the actual Data, we look for Metadata. This demonstrates how Metadata can be important to know the origin, owner, creation and other details about the data.

In today’s data-driven world, it’s not only important to get just the accurate data but it’s equally important to be aware of the Metadata of the Data. User should know if the data is reliable, if it’s protected, who’s using it, where did it come from and how it’s been changed. This information can be gathered by tracking the Metadata of the data. Hence, it’s very critical to Discover, track and manage the Metadata as it can give detailed insights to the authenticity and integrity of Data.

Metadata provides us the perspective and context of data. While receiving a data, if you ever think about the below questions then you need a Metadata Management tool.

  1. From where the Data is coming?
  2. Is the Data reliable and protected?
  3. Who is using the Data?
  4. How is this Data getting generated?

What is a Metadata?

Metadata is Data about other Data. Metadata describes the data and provides information about it. Knowing Metadata is knowing “who, what, where, why, when and how” of Data.

What is Metadata Management?

A Metadata Management Application can be used to discover, manage and store your business and Technical Data. It can also create Data Lineage which can give a deep down view of the operations happening with your data.

Metadata can also be used to classify the Data according to security levels, which can help in Data Governance. We will discuss more about Metadata and Lineage in upcoming sections. First let’s look at the objectives of discovering and storing the Metadata.

What is Apache Atlas?

“Apache Atlas is an open software that provides metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team.

To Summarize the above definition in an easier way-

Apache Atlas is a Metadata Management and Data Governance tool that tracks and manages the metadata changes happening to your data sets. Atlas provides solution for gathering, processing, storing and maintaining of Metadata. It also provides Lineage capability.

Apache Atlas brings Metadata Management, Data Governance and Lineage Capabilities under a single hood.

Which Data Sets are supported by Apache Atlas?

Apache Atlas provides native support to Hadoop based systems such as Apache Hive, Apache Sqoop, Apache Ranger, Apache Falcon and Apache Storm. This essentially means that we can easily track Metadata and lineage of Hadoop based components on Atlas, without any helper.

Apache Atlas provides four main functionalities — Metadata Management, Data Governance, Data Classification and Lineage Capabilities.

Let’s understand each of the features in more detailed way-

Features of Apache Atlas

1) Centralized Metadata Management

Atlas provides a solution to scan and record new metadata types and storing Metadata. Atlas also allows us to exchange the Metadata. Atlas can capture metadata object details and their relationships.

2) Data Classification

Atlas provides the ability to create classifications of the incoming data based on various categories. We can also create classification rules and Atlas can segregate the incoming data based on those rules.

We can classify the data as private or public and we can classify the data based on pattern/regex to identify whether the data belongs to a specific group. For example- We can create a classification based on various patterns such as Phone number, area Zip Code, Bank Account number etc. Atlas can identify the incoming data and filter them out based on those classifications.

3) Data Lineage

Data Lineage shows the origin, movement, transformation and destination of data. Data Lineage shows the life cycle of the data starting from where the Data originated, how data traveled, what are the transformations that happened on Data and eventually where did the Data finally land.

Data lineage shows the pictorial representation of the flow of the data from the origin to the destination. It also enables the companies to trace sources of specific business data, which enables them to track their movement, implementing the changes in process.

Apache Atlas Architecture

Atlas Components are divided into 4 major categories — Core, Integration, Data Sources and Application.

Atlas High Level Architecture — Overview (https://atlas.apache.org/2.0.0/Architecture.html)
  1. Atlas Core — This is the core component of Atlas which talks to backend layer and inserts the Metadata and Lineage information in Atlas Database. Atlas uses HBase database in backend for metadata store and uses Solr for indexing. The core component broadly consists of Type System which allows users to define and manage the types and entities, Graph Engine to handle relationships between Metadata Objects and Ingest/Export to add the Metadata and to raise an event in case of a change in Metadata.
  2. Integration- This Layer is used for users to communicate with Atlas. Users can manage metadata in Atlas using two methods: 1) API: All functionality of Atlas is exposed to end users via a REST API that allows types and entities to be created, updated and deleted. 2) Messaging: In addition to the API, users can choose to integrate with Atlas using a messaging interface that is based on Kafka.
  3. Metadata Sources: This layer consists of the Data Sources that are natively supported by Atlas. This means that user doesn’t need to do any operation if they have a Data Source that is directly supported in Atlas. They can just install Atlas and start tracking their Metadata in Atlas using their APIs. The Data Sources that are natively supported in Atlas are- HBase, Hive, Sqoop, Storm, Kafka.

Atlas Basic Concepts

  1. Types

‘Type’ is a concept which defines how a type of metadata object is stored in Atlas. A Type consists of Attributes. Attributes are the properties of the Metadata Object. In Programming terms, a Type is equivalent to a Class.

Let’s understand a Type with an example. Suppose you want to track the Metadata of Hive Table in Atlas, then hive_table is a Type in Atlas and the properties of that Hive Table like the db name, the created time, owner will be its attributes.

Type and Attributes of Hive Table
  1. A Type is always associated with a unique name.
  2. A Type can have many attributes.
  3. Each attribute has a Data type that it can accept.
  4. You can add any attribute based on your business use case.

2) Entities

An ‘entity’ is an instance of a ‘Type’ and thus represents each metadata object in the real world. As I mentioned above, a Type in Atlas, is equivalent to a Class and similarly, an Entity is equivalent to its Object. A Type can have many Entities.

Let’s understand a Type and Entity with an Example. Suppose we want to monitor the Metadata of Hive in Atlas. The Hive Data Source consists of Database, Table and Columns. In this case, we will have Types as hive_db, hive_table and hive_column. Now each instance of the database, table and column present in the Hive will represent an individual Entity.

Let’s assume we have a Hive Instance with 1 Database, 2 Tables, and 5 Columns, then it means that this Hive Data Source has 1 Entity of hive_db, 2 Entities of hive_table and 5 entities of hive_column. Below is an example of a hive_table entity. Each Entity consists of a typeName and attributes.

Hive Table Entity
  1. Every Entity is associated with a type referenced as typeName.
  2. Every Entity is assigned a GUID, which is a unique identifier.
  3. At any point, an Entity can be accessed using its GUID.
  4. The attribute values should be in line with the type defined.

3. Attributes

An Attribute is a property of the Type or Entity that we are managing in Atlas. We can refer an attribute as having a name and a value. We can see the attributes section in the above images. The data type of attribute value for an entity should match with the data type defined in Types.

4. Relationships

Not only we can use Atlas to store the Metadata in form of Entities, but we can also form the relationships among the Entities. Relationship defines how one Entity is related to another Entity. Atlas can map those relationships and display a pictorial representation of related Entities. This can help us in knowing how one entity is dependent on other.

Suppose if a Database has a Table, and the Table have columns, Foreign Keys references in it, then the Relationship feature will show each entity and its relationships.

Atlas Relationships- A Table Entity related to other Entities

4. Qualified Name

Qualified Name is used to uniquely identify an Entity into Atlas. Each Entity is associated with a qualified name that is unique across the Atlas. Qualified Name can be generated according to the convention that we specify.

For each entity, a qualified name should be generated according to a pattern.

Qualified Name Pattern and Sample for Hive Data Set

5. GUID

Just like the Qualified Name, a GUID is a unique ID which is assigned to each Entity by Atlas. The GUID is generated by Atlas once the Entity is inserted.

How to Communicate with Apache Atlas?

Atlas exposes a variety of REST endpoints to work with types, entities, lineage and data discovery. User can invoke the endpoints for performing the required operations. Here is the list of API Methods and Endpoints that Atlas exposes.

In addition to the API, users can choose to integrate with Atlas using a messaging interface that is based on Kafka.

I will be discussing about the various operations in Apache Atlas in upcoming article.

Summary

  1. Metadata Management is useful to get insights in your data.
  2. Apache Atlas is an Open Source tool to track and manage the Metadata. It can help in Data Governance, Classification, Lineage, and entity relationships.
  3. To track Metadata in Atlas, you should be familiar about Atlas terminologies and your Data Sets.
  4. Each Data set in Atlas is organized in Types. Each Type will have Attributes.
  5. An Attribute defines the property of a Type in a Key Value format.
  6. An Entity is an instance of Type. Each Database, Table, Columns, Views, Stored Procedures etc. present in your Data set represents a separate entity.
  7. Every Entity in Atlas is associated with a Type and has attributes. It also has a Qualified Name and a GUID.
  8. You can create relationships among entities, which will be shown in Atlas UI.
  9. Users can communicate with Atlas via API and Kafka Messaging.
  10. If your Data set is natively supported in Atlas, then you can directly track the Metadata via Atlas APIs. In case you have a Custom Data set that is not directly supported in Atlas, then you need a framework that can register your custom data set with Atlas. We will learn about this in the next article.

--

--

Manjit Singh
Manjit Singh

Written by Manjit Singh

Software Engineer at Microsoft

Responses (1)