Building a Big Data System for Disease Analysis in Hospitals
1️⃣ Introduction
In the digital era, hospitals generate vast and continuously growing amounts of patient data. To efficiently manage and analyze this data, a big data system is required to handle large-scale data processing.
In this project, my team and I built a big data system for disease management and analysis in hospitals. This system enables patient data processing from collection to visualization using Pentaho, PostgreSQL, Apache Spark, Apache Airflow, and Metabase in an Ubuntu Linux environment.
2️⃣ System Architecture
This system is designed with three main stages:
- Data Ingestion: Collecting patient data in JSON format, then cleaning it using Pentaho Data Integration (PDI) before storing it in a PostgreSQL database.
- Extract, Transform, Load (ETL): Using Apache Spark for large-scale data processing and transformation into a Data Mart.
- Orchestration & Visualization:
- Apache Airflow is used to automate the ETL workflow.
- Metabase is used to present analysis results in an interactive dashboard.
3️⃣ Data Pipeline
The following is the data pipeline flow used in this system:
Source Data (Raw Data)
- Patient and registration data are stored in JSON format with details such as:
- Patient Data:
id_pasien, name, age, gender, address, occupation, phone number, email
- Registration Data:
id_pendaftaran, id_pasien, department, payment method, registration date, symptoms
- Patient Data:
- Patient and registration data are stored in JSON format with details such as:
Data Ingestion into Database
- Pentaho Data Integration (PDI) is used to clean and insert data into the PostgreSQL database.
- The cleaned data is stored as Transaction Data Tables.
Data Transformation
link picture transformation data- Apache Spark processes data from PostgreSQL and links it with Master Data tables, such as
gejala_to_diagnosis
(symptom-to-diagnosis) anddiagnosis_to_obat
(diagnosis-to-medication). - Data transformation is performed to create a Data Mart ready for analysis.
- Apache Spark processes data from PostgreSQL and links it with Master Data tables, such as
Orchestration & Automation
link picture airflow interface- Apache Airflow schedules and controls the entire ETL process automatically.
- Ubuntu Linux is used as the main operating system for this pipeline.
Data Visualization
Data Visualization- Processed data is stored in PostgreSQL as a Data Mart.
- Metabase creates visual dashboards, helping hospitals make data-driven decisions.
4️⃣ Technologies Used
The following are the key technologies used in this project:
🔹 Pentaho Data Integration (PDI) – For initial data extraction and cleaning.
🔹 PostgreSQL – As the main database for storing transaction data and transformation results.
🔹 Apache Spark – For large-scale data processing.
🔹 Apache Airflow – For automating ETL workflows.
🔹 Metabase – For interactive dashboard visualization.
🔹 Ubuntu Linux – As the main operating system for running all big data services.
5️⃣ Conclusion
This project successfully built a big data system that helps hospitals analyze patient data more efficiently. By leveraging Pentaho, PostgreSQL, Spark, Airflow, and Metabase, we automated the ETL process and presented insights through easy-to-understand visualizations.
This system can be further enhanced by adding Machine Learning capabilities to predict diseases based on patient symptoms and improving performance through distributed computing.
💡 Big Data technology is not only crucial for businesses but also for the healthcare sector to enhance service quality and data-driven decision-making. 🚀
Komentar
Posting Komentar