Implement a multicloud data lake integration architecture

This reference architecture describes how you can bring the data from different cloud providers and on-premises data sources to a data lake hosted in OCI. This architecture covers batch integration, data integration, real time integration and event based integration scenarios.

The following diagram illustrates the data flow for this reference architecture.Description of the illustration oci_multicloud_datalake_flow.png

oci-multicloud-datalake-flow-oracle.zip

The following diagram illustrates this reference architecture.

oci-multicloud-datalake-oracle.zip

Oracle Integration Cloud (OIC) is used for the following scenarios:

  • Receiving data from Oracle Cloud applications, CRM, E-commerce and on-premises/3rd party cloud applications in real-time and then loading into data lake.
  • Loading the data into data lake from a file (less volume) generated by a data-source.
  • Exposing Oracle Integration Cloud REST APIs to webhook platforms, receiving the data in real-time and loading into the data lake.
  • Some IOT platforms (Geotab, CheckSafe, and so on) have webhook fuctionality and send data to any https api for new events so they can connect directly to the API Gateway.
  • Receiving data from social media platforms (Facebook, LinkedIn, Twitter, Slack, and so on) and loading into the OCI data lake.

The architecture has the following components:

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don’t overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Integration

    Oracle Integration is a fully managed service that allows you to integrate your applications, automate processes, gain insight into your business processes, and create visual applications.

  • Oracle Data Integration

    Oracle Cloud Infrastructure Data Integration is a fully managed, serverless, cloud-native service that extracts, loads, transforms, cleanses, and reshapes data from a variety of data sources into target Oracle Cloud Infrastructure services, such as Autonomous Data Warehouse and Oracle Cloud Infrastructure Object Storage. ETL (extract transform load) leverages fully-managed scale-out processing on Spark, and ELT (extract load transform) leverages full SQL push-down capabilities of the Autonomous Data Warehouse in order to minimize data movement and to improve the time to value for newly ingested data. Users design data integration processes using an intuitive, codeless user interface that optimizes integration flows to generate the most efficient engine and orchestration, automatically allocating and scaling the execution environment. Oracle Cloud Infrastructure Data Integration provides interactive exploration and data preparation and helps data engineers protect against schema drift by defining rules to handle schema changes.

  • Oracle Business Intelligence Cloud Connector

    The Oracle BI Cloud Connector (BICC) is a useful tool for extracting data from Fusion and for storing it in shared resources like Oracle Universal Content Management (UCM) Server or cloud storage in CSV format.

  • OIC Connectivity Agent

    With the OIC connectivity agent, you can create hybrid integrations and exchange messages between applications in private or on-premises networks and Oracle Integration Cloud.

  • Data Lake

    A data lake is a scalable, centralized repository that can store raw data and enables an enterprise to store all its data in a cost effective, elastic environment. A data lake provides a flexible storage mechanism for storing raw data.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for “hot” storage that you need to access quickly, immediately, and frequently. Use archive storage for “cold” storage that you retain for long periods of time and seldom or rarely access.

  • Autonomous Database

    Oracle Cloud Infrastructure Autonomous Database is a fully managed, preconfigured database environments that you can use for transaction processing and data warehousing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

  • Analytics

    Oracle Analytics Cloud is a scalable and secure public cloud service that empowers business analysts with modern, AI-powered, self-service analytics capabilities for data preparation, visualization, enterprise reporting, augmented analysis, and natural language processing and generation. With Oracle Analytics Cloud, you also get flexible service management capabilities, including fast setup, easy scaling and patching, and automated lifecycle management.

  • Data catalog

    Oracle Cloud Infrastructure Data Catalog is a fully managed, self-service data discovery and governance solution for your enterprise data. It provides data engineers, data scientists, data stewards, and chief data officers a single collaborative environment to manage the organization’s technical, business, and operational metadata.