Building a Big Data System for Disease Analysis in Hospitals

1️⃣ Introduction

In the digital era, hospitals generate vast and continuously growing amounts of patient data. To efficiently manage and analyze this data, a big data system is required to handle large-scale data processing.

In this project, my team and I built a big data system for disease management and analysis in hospitals. This system enables patient data processing from collection to visualization using Pentaho, PostgreSQL, Apache Spark, Apache Airflow, and Metabase in an Ubuntu Linux environment.

2️⃣ System Architecture

This system is designed with three main stages:

Data Ingestion: Collecting patient data in JSON format, then cleaning it using Pentaho Data Integration (PDI) before storing it in a PostgreSQL database.
Extract, Transform, Load (ETL): Using Apache Spark for large-scale data processing and transformation into a Data Mart.
Orchestration & Visualization:
- Apache Airflow is used to automate the ETL workflow.
- Metabase is used to present analysis results in an interactive dashboard.

3️⃣ Data Pipeline

The following is the data pipeline flow used in this system:

link picture data pipeline

Source Data (Raw Data)
- Patient and registration data are stored in JSON format with details such as:
  - Patient Data: id_pasien, name, age, gender, address, occupation, phone number, email
  - Registration Data: id_pendaftaran, id_pasien, department, payment method, registration date, symptoms
Data Ingestion into Database
- Pentaho Data Integration (PDI) is used to clean and insert data into the PostgreSQL database.
- The cleaned data is stored as Transaction Data Tables.
Data Transformation

link picture transformation data
- Apache Spark processes data from PostgreSQL and links it with Master Data tables, such as gejala_to_diagnosis (symptom-to-diagnosis) and diagnosis_to_obat (diagnosis-to-medication).
- Data transformation is performed to create a Data Mart ready for analysis.
Orchestration & Automation

link picture airflow interface
- Apache Airflow schedules and controls the entire ETL process automatically.
- Ubuntu Linux is used as the main operating system for this pipeline.
Data Visualization

Data Visualization
- Processed data is stored in PostgreSQL as a Data Mart.
- Metabase creates visual dashboards, helping hospitals make data-driven decisions.

4️⃣ Technologies Used

The following are the key technologies used in this project:

🔹 Pentaho Data Integration (PDI) – For initial data extraction and cleaning.
🔹 PostgreSQL – As the main database for storing transaction data and transformation results.
🔹 Apache Spark – For large-scale data processing.
🔹 Apache Airflow – For automating ETL workflows.
🔹 Metabase – For interactive dashboard visualization.
🔹 Ubuntu Linux – As the main operating system for running all big data services.

5️⃣ Conclusion

This project successfully built a big data system that helps hospitals analyze patient data more efficiently. By leveraging Pentaho, PostgreSQL, Spark, Airflow, and Metabase, we automated the ETL process and presented insights through easy-to-understand visualizations.

This system can be further enhanced by adding Machine Learning capabilities to predict diseases based on patient symptoms and improving performance through distributed computing.

💡 Big Data technology is not only crucial for businesses but also for the healthcare sector to enhance service quality and data-driven decision-making. 🚀

Cecep Ade Saputra

Cari Blog Ini