This content was translated from Korean to English using AI.

Conclusion

  • Register and manage everything through LakeFormation Data Lake Locations.

Background

One of the biggest challenges while operating an AWS data lake environment was
how to share data.

As the organization grew, accounts were separated (Multi Account), and the teams producing and consuming data diverged. Managing
“how much access to grant” and
“who accesses what and how” became increasingly difficult.

I have experienced various data sharing approaches over time,
and after considerable trial and error, the pros and cons of each method became clear.

Before systematizing a data sharing approach that fits our company’s situation,
I wanted to first organize and review the concepts and selection criteria.

Goal

  • Understand the permission check flow during data queries and identify where permission issues occur.
  • Understand the various data sharing methods and the flow of each.
  • Be able to evaluate and select the appropriate data sharing method.

Explanation

Types and Criteria for Data Sharing Methods

The “data” referred to here means tables registered in the Glue Catalog (schema or table data).

The choice depends on “what to share” and “how much to control.”

  • Manage primarily through LakeFormation (LF) and handle exceptions separately for the best results.

In an AWS data lake environment, data sharing can be broadly divided into three approaches:

1. Quick data file sharing only: S3/IAM

2. Sharing schema and tables: Glue Data Catalog


Permission Check Flow During Data Queries

  • The high-level flow is as follows:

User Data Queries

- Accessing tables via Athena queries
- Glue ETL reading Catalog tables as input
- EMR/Redshift Spectrum referencing Catalog tables
- Accessing Catalog tables via boto3 through Athena, Glue, EMR, Redshift

What it means to manage with Lake Formation...

  • Registering S3 data locations (bucket/prefix) as Data Lake Locations in Lake Formation
  • The Lake Formation permission model applies to Glue Data Catalog objects (databases/tables) that point to the registered S3 locations
  • Lake Formation assumes the IAM role specified during registration and issues temporary credentials (credential vending) to integrated services (Athena/EMR/Glue, etc.)
  • Subdirectories under the registered path are included in the management scope
  • With Hybrid access mode, you can gradually transition by applying LakeFormation permissions to only some databases/tables in the Data Catalog

Understanding Through Examples

Example Scenario

  • Producer / Data Lake Account A
    • Stores data in S3
    • Manages metadata in Glue Data Catalog
  • Consumer / Analytics Team Account B
    • Uses shared data via Athena/Glue/ETL, etc.

Quick Data Sharing Only: S3 / IAM

The most basic approach, sharing only the S3 objects themselves.

Characteristics

  • Grants S3 access via bucket policies + consumer account IAM permissions
  • The consumer account must create tables manually for the metadata of the shared data
  • Glue Catalog metadata is not shared

Things to Be Aware Of

  • Best suited for quickly granting data access, but
  • Since only data is shared, schema synchronization requires Glue Crawler or other supplementary measures.

Detailed Implementation Flow

1. Producer (Account A)
- Identify the S3 bucket/prefix to share
- Add Consumer (Account B) access permissions to the bucket policy
- (If encryption is used) Add Consumer permissions to the KMS Key policy
 
2. Consumer (Account B)
- Grant S3 Read permissions to the IAM Role/User
- Create tables manually in Glue Data Catalog or generate schema via Crawler
- Start querying via Athena/Glue/EMR using the created tables

Sharing Schema and Tables: Glue Data Catalog

A method that shares not only data but also table definitions (schema).

Characteristics

  • Set resource policies on the data-owning account’s Glue Data Catalog
  • The consumer account registers it as an external DataCatalog in Athena
  • Tables can be queried in the format ownerCatalog.db.table

Things to Be Aware Of

  • After Glue Catalog permission verification,
    - S3 access is checked separately via S3/IAM policies

Detailed Implementation Flow

1. Producer (Account A)
- Determine the target DB/tables to share
- Add Consumer (Account B) permissions to the Data Catalog Resource Policy
- Identify the S3 bucket/prefix to share
- Add Consumer (Account B) access permissions to the bucket policy
 
2. Consumer (Account B)
- Grant Glue permissions received from Account A to the Consumer IAM Role/User
- Register the external Data Catalog in Lake Formation
- Query in the format producerCatalog.db.table

An approach that leverages Data Lake Locations in Lake Formation, a dedicated data lake governance service.

Characteristics

  • Permission management at the DB/table level
  • Row-level and column-level access control available
  • Requires AWS RAM invitation acceptance + Resource Link creation

Advantages

  • Enables policy-centric data access management
  • Granular control at the account/role/user level

Detailed Implementation Flow

1. Producer (Account A)
- Review existing Glue tables
  Check DB/tables for the S3 paths to register
- Register Lake Formation Data Lake Location
  Select S3 path
  Specify IAM Role for LF to assume
  Check Hybrid access mode if needed
  (Hybrid access mode: keep existing IAM access vs. separate LF-governed targets)
- Grant DB/table permissions in Lake Formation
  Grant to Consumer account/ORG/OU
- Create AWS RAM sharing invitation
  Send resource sharing invitation to the Consumer account
- (If using Hybrid) Configure opt-in for LF-governed targets
  "Make LF Permissions effective immediately" option available
  
2. Consumer (Account B)
- Accept the invitation in the AWS RAM console
- Create a Resource Link in Lake Formation
- Grant Resource Link permissions
  Describe (Resource Link)
  Grant on target (source resource)
- Delegate permissions to internal Consumer IAM Roles/Users
  (If using Hybrid) Configure opt-in settings