iFARM- A Big data platform for data analytics

The healthcare industry has different stakeholders of data. The industry can be demographically classified into few select groups as enlisted below.

The list is exhaustive so lets group them into Bio-tech, Life sciences, Pharmaceutical, Insurance, Device manufacturing companies. Another group of data owners could be the research institutions and hospitals, clinics, diagnostic labs. Lastly, regulatory authorities and government agencies need healthcare data.

With 80% increase of data in last one decade and volume reaching scale of zettabyte, estimated to grow by 44 zettabyte or 44 trillion gigabyte by 2020 (credit Forbes), the challenge to curate data is going to drive innovation.

Of all data, only 0.5% is analyzed, rest ends up in archives or lost due to lack of technology or lack of foresightedness.

However, Industry has realized the importance of data; either ramping up the technology infrastructure or aspiring for technology which can enable data stewards leverage the data cloud for some useful purpose.

Many companies are working on developing various solutions and offering services on cloud and in-premise in collaboration with leading technology vendors for application development either in cloud or in-premise to manage and utilize the data for deep data excavation, mining and insights.

However, inevitably envisioning an architecture landscape without complicating technology and data management within the enterprise architecture landscape is deriving companies to invest into building futuristic frameworks and products within managed architectural landscape.

iFARM is one such product built on Hortonworks data platform which uses open source technology framework in offering in-premise and cloud hosted platform as a service solution.

The concept is simple to provide an underlining technology framework powering business application build on top of the technology framework encapsulating the different health domains into application as a service.

System Architecture

Clients workstations can connect to the cluster by create a session instance using valid credentials. The clients are authenticated and given access to the

iFARM PaaS Farmework

The architecture has four main pillars within three areas. The first area is the data acquisition layer, the second is data processing layer and the last one is the presentation layer.

Data acquisition layer begins with the set of connectors and clients protocols to support various mechanisms to acquire data within the cluster. Before data is acquired, connection and session is established. Various mode of authentication is supported. MIT Kerberos, LDAP over SSL, SGML and other protocols are supported. The connections and sessions are encrypted to ensure data privacy and security.

Cluster can be mounted on client machines or clients can access cluster by making API calls to gateway which hosts NFS, Sambha, Hadoop Client, Proxy and KNOX. Gateway is installed on edge node of the cluster.

User access to various services in the cluster is managed by Ranger. Ranger is used to set up access policies for various services like HDFS, KNOX, HBase, Hive, ATLAS, Solr and ATLAS. ATLAS is used for metadata management and data governance.

Different services within the cluster in clients and server topology communicate over secure tunnels and work load management is achieved by inter-process communication, remote procedure, API calls and secure shells. Data is stored in file systems by using Parquet, ORC and Sequence format and compressed using Gzib or Bizip2 and by using HDFS directives for caching data into memory for faster I/O operations.

Data processing layer would consume the data and process it downstream. Data is persisted in various data stores like Mongo DB, Neo4J, MySQL, Hbase, Hive and HDFS. The data is processed by standard ELT tools like Talend

Processed data, stored in various databases and file system is accessed by various applications. Data extraction, analysis and visualization is three important functions of data presentation layer. Zeppelin is used by users for data analysis and visualization.

Security and privacy

Authentication and authorization

Authentication of a client is over secure tunnel. Access is centrally managed over LDAP or Kerberos. The LDAP is configured over SSL. The access control can be also POSIX based. SSH and SSL is used for authentication and connection encryption. Kerberos can use a variety of cipher algorithms to protect data.

A Kerberos encryption type (also known as an enctype) is a specific combination of a cipher algorithm with an integrity algorithm to provide both confidentiality and integrity to data.

Data security

Public Key Infrastructure (PKI) is the set of hardware, software, people, policies, and procedures that are needed to create, manage, distribute, use, store, and revoke digital certificates.

User authentication, client server session and in rest or in motion data is encrypted using asymmetric 1024 or 2024 bits keys like RSA if needed however it would slow the decryption or 256 bits symmetric keys to avoid brute force attack.

Data access, transport and CRUD operation is centrally managed and governed. ATLAS is used to implement data governance and audit control. RANGER is used to define policies to access file system and databases to users. User impersonation, ACL audit control is managed by RANGER. PHI, PII is protected and required safeguards are in place to govern the data security and privacy.

 Security in motion

Security in motion is applied using SSL. Web Service calls, or REST based integration is centrally managed by KNOX. KNOX acts a gateway to access several application APIs. The access to gateway is managed and controlled using Kerberos and LDAP over secure later.

Security at rest

Data security is applied using key management service also called KMS. File system or entire mount can be encrypted. Any data written or read is encrypted and decrypted by the KMS service. User access and access to the key and metadata is managed by way of defining policies for the users and KMS zone in KMS RANGER.

Zones or file system can be shared and mounted on client machine using NFS or Samba. Data can then be shared across clients over public and private network meeting all security and privacy rules.

Infrastructure and Network security

Infrastructure security is provided by the IaaS and PaaS. However firewall, virtual IPs, Proxy servers are set up. All required services are installed to ensure availability of preventive, deterrent, detective and corrective controls. IaaS providers would ensure identity management, infrastructure privacy and physical security.

iFARM Infrastructure as a service is FedRAMP, FISMA, FFIEC, SOC, ISO 27001, ISO 27017, ISO 27018, HITRUST and other industry complaint.

iFARM Analytics workbench

iFarm offers an application called CALA- Computer Aided Life Analytics which has three components.

1- Omilytics

OMICS would drive the next generation medicines and solve mysteries of life; however challenge is to deal with highly complex biological and non-biological data and disparity in its demographics, format and volume of data. Systems and technology collaboration would be the only way to move forward.

2- Diaglytics

Diagnostic data; over 80% of which is image and unstructured would constitute the next generation data sources in designing efficient treatment regimen, drug discovery, patient health management. This application solves to handle the bio-medical imaging processing challenges with prediction and recommendation engine.

3- Oxalytics

Oxalytics is another application which works on sensor data emanating from home care sensor devices, industrial sensors. The application provides life time experience to the users to check their physical activities and vitals real-time. It helps hospitals and industries manage the huge data generated from the sensors and run high computing algorithm to analyze the data.


iFARM is built to address the enterprise data requirements for the healthcare industry. It provides enterprise data model, compatible to most suppliers data models such as Cerner, Humedica, Centricity EMR and so on and compliant to SDTM, OMOB, CDISC, BRIDG, OMICS, ADAM, GA4GH, ICD, SNOMED, MEDRA. It is compatible to HL7 data exchange standards. It provides out of box APIs for data curation, analytical engine for deep image analytics, sensor and diagnostic data analytics, genome analysis and above all, it is HIPAA compliant. It is powered by HDP (Hortonworks Distribution).