Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services
Table 4
Operational experiences, persistent issues, and overall limitations of tested Big Data technologies and components that impacted Big Data Analytics (BDA) platform.
Technology component
Clinical impact to platform
Hadoop Distributed Filing System (HDFS)
(i) Did not reconfigure more than 6 nodes because it is very difficult to maintain clinical data (ii) Had to add additional 2–4 TB for clinical data (iii) The clinical data needed large local disks
MapReduce
(i) Totally failed ingestion (ii) Clinical index files must be removed from node (iii) Extremely slow performance when working with clinical data (iv) Clinical data need more advanced algorithms
HBase
(i) RegionServers needed to form the clinical database (ii) Ongoing monitoring and log checking (iii) Run compaction (iv) Ran only 50 million rows of clinical data
ZooKeeper & YARN
(i) Extremely slow performance when ZooKeeper services are not running properly for both, but additional configuration minimized this limitation with few issues for YARN
Phoenix
(i) To maintain a database schema with current names in a file on the nodes, such that if the files ingested do not match, it will show error, and to verify ingested data exists within the metadata of schema while running queries (ii) This occurred zero times while ingesting files but many times at first when running queries
Spark
(i) Slow performance
Zeppelin
(i) 30-minute delay before running queries which takes the same amount of time as with Jupyter (ii) No fix to this issue
Jupyter
(i) Once the Java is established, it has high usability and excellent performance
Drill
(i) It is extremely fast but has poor usability (ii) Some integration to other interface engines