A modular and interpretable framework for tabular data analysis using LLaMA 7B: Enhancing preprocessing, modeling, and explainability with local language models

doi:10.1371/journal.pone.0341002

Table 1.

Summary of Related Work across the Tabular ML Pipeline.

More »

Expand

Table 2.

Descriptive statistics for selected variables in the raw dataset.

More »

Expand

Table 3.

Summary statistics of raw features before transformation.

More »

Expand

Fig 1.

Three separate visualizations replacing the original pairplot.

These plots illustrate the relationships and distributional patterns among age, waiting_days, and the target label show_up. (a) Scatter: age vs. waiting_days by show_up. (b) Distribution of waiting_days across classes. (c) Distribution of age across classes.

More »

Expand

Table 4.

Deterministic controls ensuring reproducible LLaMA-7B prompt behavior.

More »

Expand

Table 5.

Summary of preprocessing measures applied to dataset irregularities.

More »

Expand

Table 6.

Deterministic controls used to ensure reproducible LLaMA-7B behavior.

More »

Expand

Fig 2.

End-to-end pipeline architecture showing the major modules from input data ingestion to profiling and final reporting.

More »

Expand

Fig 3.

Detailed data flow diagram outlining each transformation step, branching logic, and artifact generation during the pipeline lifecycle.

More »

Expand

Table 7.

Comparison of popular LLMs for local tabular preprocessing tasks.

More »

Expand

Table 8.

Final dataset profile and runtime environment.

More »

Expand

Table 9.

Comparative classification performance: LLaMA 7B vs. Mistral 7B on medical no-show prediction.

More »

Expand

Table 10.

Final classification performance: Logistic Regression vs. XGBoost on medical no-show prediction.

More »

Expand

Fig 4.

Spearman correlation heatmap across numeric and encoded categorical features.

Strong negative correlation observed between WaitingDays and appointment-related features.

More »

Expand

Fig 5.

Boxen plot illustrating the distribution of waiting days by gender and no-show status.

Longer wait times are more common among no-show patients, particularly among males.

More »

Expand

Table 11.

Categorical feature distributions.

The complete categorical distribution table is provided in S1 Table.

More »

Expand

Fig 6.

Missing value matrix confirming complete data coverage across all key features.

More »

Expand

Fig 7.

Unique values per feature.

More »

Expand

Fig 8.

Heatmap of appointment counts across neighborhoods.

More »

Expand

Fig 9.

Scheduled appointments over time.

More »

Expand

Fig 10.

Appointment dates distribution.

More »

Expand

Fig 11.

Age ECDF distribution.

More »

Expand

Fig 12.

Waiting days distribution.

More »

Expand

Fig 13.

Correlation among encoded features.

The feature encoding map used to generate the encoded matrix is provided in S2 Table.

More »

Expand

Fig 14.

Missingness after engineering.

The full missingness report is provided in S3 Table.

More »

Expand

Fig 15.

Scaled numeric features (violin plot).

More »

Expand

Fig 16.

Confusion matrix of predictions.

More »

Expand

Table 12.

Class distribution and support counts for the fine-tuned LLaMA 7B model.

More »

Expand

Fig 17.

Precision–recall curve of the logistic regression classifier (AP = 0.87).

More »

Expand

Fig 18.

ROC curve with AUC = 0.65.

More »

Expand

Fig 19.

Explained variance per PCA component.

More »

Expand

Fig 20.

t-SNE visualization of LLaMA embeddings.

Separation between ‘Show’ and ‘No-show’ is evident but not clean, indicating potential for further tuning.

More »

Expand

Fig 21.

SHAP dependence plot for age.

More »

Expand

Fig 22.

SHAP summary plot showing global feature importance and individual SHAP value distribution.

More »

Expand

Fig 23.

SHAP interaction plot between waiting_days and age.

More »

Expand

Table 13.

Top-ranked features by mean SHAP value (importance).

The complete feature ranking is provided in S4 Table.

More »

Expand

Fig 24.

Violin plots by show-up status.

More »

Expand

Fig 25.

PCA scatterplot of features.

More »

Expand

Fig 26.

Feature-wise memory usage.

More »

Expand