Research Article

Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services

Table 4

Operational experiences, persistent issues, and overall limitations of tested Big Data technologies and components that impacted Big Data Analytics (BDA) platform.

Technology componentClinical impact to platform

Hadoop Distributed Filing System (HDFS)(i) Did not reconfigure more than 6 nodes because it is very difficult to maintain clinical data
(ii) Had to add additional 2–4 TB for clinical data
(iii) The clinical data needed large local disks

MapReduce(i) Totally failed ingestion
(ii) Clinical index files must be removed from node
(iii) Extremely slow performance when working with clinical data
(iv) Clinical data need more advanced algorithms

HBase(i) RegionServers needed to form the clinical database
(ii) Ongoing monitoring and log checking
(iii) Run compaction
(iv) Ran only 50 million rows of clinical data

ZooKeeper & YARN(i) Extremely slow performance when ZooKeeper services are not running properly for both, but additional configuration minimized this limitation with few issues for YARN

Phoenix(i) To maintain a database schema with current names in a file on the nodes, such that if the files ingested do not match, it will show error, and to verify ingested data exists within the metadata of schema while running queries
(ii) This occurred zero times while ingesting files but many times at first when running queries

Spark(i) Slow performance

Zeppelin(i) 30-minute delay before running queries which takes the same amount of time as with Jupyter
(ii) No fix to this issue

Jupyter(i) Once the Java is established, it has high usability and excellent performance

Drill(i) It is extremely fast but has poor usability
(ii) Some integration to other interface engines