Classification of Medical Diagnoses based on Symptoms
Using Text Classification via HuggingFace π€ to Predict Medical Diagnoses based on Symptoms
DISCLAIMER:
I am not a medical doctor. This is an exercise in LLM text classification and should not be used for diagnostic purposes.
Nothing in this exercise should be taken as medical advice. Please consult your physician for actual medical advice.
I do not own the data used in this exercise and I was not involved in its collection.
Introduction
Classification of text data using LLMs is extremely useful as it makes it possible to convert natural language into a quantifiable and predictive format. There are countless use cases for text classification, and in this article I will be applying classification to medical diagnostic data.
The goal is to predict a "diagnosis" based on an input of symptoms.
The following libraries are used for this exercise. For specific instructions on using the latest PyTorch libraries see here.
$ pip install -U transformers datasets accelerate
$ pip install -U torch torchvision torchaudio
$ pip install -U polars
The Data
The first step is to pull the dataset from the HuggingFace Datasets hub. The dataset used in this exercise comes from the HuggingFace hub for user mohammad2928git and can be found here here.
Data is loaded using the Polars library, and the fields used will be text and label. Only observations where the text field is not null are kept:
import polars as pl
data_scan = (
pl.scan_parquet("hf://datasets/mohammad2928git/complete_medical_symptom_dataset/data/train-*.parquet")
.filter(
pl.col("text").is_not_null()
)
.select(
pl.col("text"),
pl.col("lebel_text").map_elements(lambda x: x[0], return_dtype = pl.String).alias("label")
)
)
data_polars = data_scan.collect()
print(data_polars.head(3))
shape: (3, 2)
βββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β text β label β
β --- β --- β
β str β str β
βββββββββββββββββββββββββββββββββββͺββββββββββββββββββββββββββββ‘
β I have been having migraines aβ¦ β drug reaction β
β I have asthma and I get wheeziβ¦ β allergy β
β Signs and symptoms of primary β¦ β premature ovarian failure β
βββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
For this exercise, and to keep training compute time to a minimum, the data is cut down to the top 10 most-common labels. This results in a data set of 1,706 unique observations:
top_ten_labels = (
data_polars
.group_by("label")
.agg(
pl.len().alias("n")
)
.sort("n", descending = True)
.head(10)
.select("label")
.to_series()
.to_list()
)
top_ten_labels
['psoriasis',
'arthritis',
'cervical spondylosis',
'impetigo',
'malaria',
'varicose veins',
'allergy',
'bronchial asthma',
'dengue',
'chicken pox']
data_polars = (
data_polars
.filter(
pl.col("label").is_in(top_ten_labels)
)
)
data_polars.shape
(1706, 2)
An important step for training is to create a label mapper which will be used to map the labels to corresponding int values:
label_mapper = dict()
for i, label in enumerate(top_ten_labels):
label_mapper[label] = i
label_mapper
{'psoriasis': 0,
'arthritis': 1,
'cervical spondylosis': 2,
'impetigo': 3,
'malaria': 4,
'varicose veins': 5,
'allergy': 6,
'bronchial asthma': 7,
'dengue': 8,
'chicken pox': 9}
The data is then mapped with the label mapper, using the replace_strict() method from Polars:
data_polars = (
data_polars
.select(
pl.col("text"),
pl.col("label").replace_strict(label_mapper, return_dtype = pl.Int32)
)
)
print(data_polars.head(3))
shape: (3, 2)
βββββββββββββββββββββββββββββββββββ¬ββββββββ
β text β label β
β --- β --- β
β str β i32 β
βββββββββββββββββββββββββββββββββββͺββββββββ‘
β I have asthma and I get wheeziβ¦ β 6 β
β cough,high_fever,breathlessnesβ¦ β 7 β
β chills,vomiting,high_fever,sweβ¦ β 4 β
βββββββββββββββββββββββββββββββββββ΄ββββββββ
Personally, I like saving data to local storage at certain increments. This is purely optional.
data_polars.write_csv("data/data_polars.csv")
Next, split the data set into Train and Test sets:
import math
def split_train_test(pdf, train_frac = 0.8):
train_size = math.floor(pdf.shape[0] * train_frac)
pdf = pdf.sample(fraction = 1, shuffle = True)
train, test = pdf.head(train_size), pdf.tail(-train_size)
return train, test
data_train, data_test = split_train_test(data_polars)
Again, saving these splits as local objects in a local data directory is a good idea:
data_train.write_csv("data/data_train.csv")
data_test.write_csv("data/data_test.csv")
Next, the data is converted from a Polars object to a HuggingFace DatasetDict object to be used for the remainder of this exercise.
Here I am using the load_dataset() function from the HuggingFace datasets library, as this is the preferred method of loading datasets from the HF Datasets hub. The only difference is that since the splits are locally stored from the previous step, their local paths will be provided as input (as opposed to a HF url):
from datasets import load_dataset
ds_dict_paths = {"train": "data/data_train.csv", "test": "data/data_test.csv"}
ds_dict = load_dataset("csv", data_files = ds_dict_paths)
ds_dict
Generating train split: 1364 examples [00:00, 50494.53 examples/s]
Generating test split: 342 examples [00:00, 84728.41 examples/s]
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 1364
})
test: Dataset({
features: ['text', 'label'],
num_rows: 342
})
})
Finally, split the Train set further into a Train & Validation set, which will be used in finetuning:
## Split:
ds_ttv = ds_dict["train"].train_test_split(train_size = 0.8, seed = 42)
## Rename the newly-created "test" to "validation":
ds_ttv["validation"] = ds_ttv.pop("test")
## Add the original "test" set to the new dataset:
ds_ttv["test"] = ds_dict["test"]
## No longer need original:
del ds_dict
ds_ttv
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 1091
})
validation: Dataset({
features: ['text', 'label'],
num_rows: 273
})
test: Dataset({
features: ['text', 'label'],
num_rows: 342
})
})
Save the Train / Test / Validation set to disk:
ds_ttv.save_to_disk("data/ds_ttv")
This creates the following data structure inside your data directory:
ds_ttv/
βββ dataset_dict.json
βββ test
β βββ dataset.arrow
β βββ dataset_info.json
β βββ state.json
βββ train
β βββ dataset.arrow
β βββ dataset_info.json
β βββ indices.arrow
β βββ state.json
βββ validation
βββ dataset.arrow
βββ dataset_info.json
βββ indices.arrow
βββ state.json
With the data proccessing steps complete, and the data now available to be loaded using the load_from_disk() function, we move on to the Tokenization step.
Tokenization
Text data needs to be converted to a numeric format so that it can be used by the model, and Tokenization is the process by which this is accomplished. For a detailed look at HF's tokenizer implementations, refer to their docs.
First, load the dataset split from the last section. This can be done using the load_from_disk() function from the HuggingFace datasets library:
import torch
from datasets import load_from_disk
ds_ttv = load_from_disk("data/ds_ttv")
ds_ttv
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 1091
})
validation: Dataset({
features: ['text', 'label'],
num_rows: 273
})
test: Dataset({
features: ['text', 'label'],
num_rows: 342
})
})
Next, define the tokenizer method and function that will be used to tokenize the data. This is done using the AutoTokenizer() function from the HuggingFace transformers library, and here I am using the "google-bert/bert-base-cased" model for tokenization. Additionally, the map() method is used to apply the tokenizer_function() to the Train / Test / Validation sets:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
def tokenizer_function(ds, text_column = "text"):
return tokenizer(ds[text_column], padding = "max_length", truncation = True)
tokenized_ds_ttv = ds_ttv.map(tokenizer_function, batched = True)
tokenized_ds_ttv
Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1091/1091 [00:00<00:00, 8037.90 examples/s]
Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 273/273 [00:00<00:00, 7628.45 examples/s]
Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 342/342 [00:00<00:00, 8940.35 examples/s]
DatasetDict({
train: Dataset({
features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1091
})
validation: Dataset({
features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 273
})
test: Dataset({
features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 342
})
})
Finally, a collator function is used to create batched samples. It will be created using the same Tokenizer from the previous step, and will be passed to the model Trainer in a subsequent step. The Data Collator is created using HF's DataCollatorWithPadding() method, which also applies the neccessary padding to each batch:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)
Training
Fine-tuning the model via Training requires a few steps.
The starting model used here is the same "google-bert/bert-base-cased" model that was used in the tokenization step above. It is a classification model, which is loaded using HF's AutoModelForSequenceClassification() method. It is important to define the correct number of labels present in the data, for this exercise num_labels = 10.
The TrainingArguments() method is used to define the training args.
from transformers import AutoModelForSequenceClassification, TrainingArguments
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels = 10)
training_args = TrainingArguments(
output_dir = "checkpoints",
eval_strategy = "epoch"
)
It is also useful to define a compute_metrics() function for reporting during training:
import evaluate
import numpy as np
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis = -1)
return metric.compute(predictions = predictions, references = labels)
The last step is to run the actual Training. For this purpose, the Trainer() class from HF is used, with the .train() method applied to it. Note that the Validation data set from an early step is passed as the evaluation set.
The Validation Loss and Accuracy from each of the three epochs is returned during the training process.
from transformers import Trainer
trainer = Trainer(
model = model,
args = training_args,
train_dataset = tokenized_ds_ttv["train"],
eval_dataset = tokenized_ds_ttv["validation"],
compute_metrics = compute_metrics,
data_collator = data_collator
)
trainer.train()
[411/411 04:05, Epoch 3/3]
Epoch Training Loss Validation Loss Accuracy
1 No log 0.185095 0.978022
2 No log 0.063872 0.985348
3 No log 0.038519 0.989011
TrainOutput(global_step=411, training_loss=0.48913615057358195, metrics={'train_runtime': 246.8817, 'train_samples_per_second': 13.257, 'train_steps_per_second': 1.665, 'total_flos': 861224340436992.0, 'train_loss': 0.48913615057358195, 'epoch': 3.0})
Once training is completed, save the model to a local models directory to be used in subsequent steps:
trainer.save_model("models/google-base-cased-tuned-on-medical-text-top-ten")
Using the Trained Model
At this point we have a trained classifier model, stored in a local path at "models/google-base-cased-tuned-on-medical-text-top-ten". This model can now be used to perform classification on new data, which was not used to train the model.
NOTE: At this stage I restart my kernel to simulate a real-world scenario where your starting point is a tuned model. This is an optional step, but it is a good sanity check.
First, load the model using the same AutoModelForSequenceClassification() function, but this time make sure to provide the local path to the trained model.
As before, it is important to define the correct number of labels present in the data:
from transformers import AutoModelForSequenceClassification
tuned_model = AutoModelForSequenceClassification.from_pretrained(
"models/google-base-cased-tuned-on-medical-text-top-ten",
num_labels = 10
)
Next, load the new data (Test Data) that was created in the first section of this exercise. Again, this is to simulate a real-world scenario in which you are interested in runing a prediction on some new data.
import polars as pl
test_data = pl.read_csv("data/data_test.csv")
print(test_data.head())
shape: (5, 2)
βββββββββββββββββββββββββββββββββββ¬ββββββββ
β text β label β
β --- β --- β
β str β i64 β
βββββββββββββββββββββββββββββββββββͺββββββββ‘
β I have prominent blood vesselsβ¦ β 5 β
β continuous_sneezing,shivering,β¦ β 6 β
β chills,vomiting,high_fever,sweβ¦ β 4 β
β itching,skin_rash,fatigue,lethβ¦ β 9 β
β fatigue,cramps,bruising,obesitβ¦ β 5 β
βββββββββββββββββββββββββββββββββββ΄ββββββββ
Tokenize this new data using the same tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
def make_token_list(text_list, use_tokenizer):
return [use_tokenizer(t, truncation = True, is_split_into_words = False, return_tensors = "pt") for t in text_list]
test_data_tokenized_list = make_token_list(
text_list = test_data.select(pl.col("text")).to_series().to_list(),
use_tokenizer = tokenizer
)
Finally, run classification on this new data:
import numpy as np
def predict_token_list(mod, token_list):
return [int(mod(t["input_ids"]).logits.argmax(-1)[0]) for t in token_list]
preds = predict_token_list(
mod = tuned_model,
token_list = test_data_tokenized_list
)
There are various methods for evaluation of predictions. For the sake of brevity for this exercise I am just looking at the percent of the labels classified correctly by the model. Of course, more stringent methods of evaluation can be applied.
test_data_with_preds = test_data.with_columns(pl.Series(name = "predicted_label", values = preds, dtype = pl.Int8))
print(test_data_with_preds.head())
shape: (5, 3)
βββββββββββββββββββββββββββββββββββ¬ββββββββ¬ββββββββββββββββββ
β text β label β predicted_label β
β --- β --- β --- β
β str β i64 β i8 β
βββββββββββββββββββββββββββββββββββͺββββββββͺββββββββββββββββββ‘
β I have prominent blood vesselsβ¦ β 5 β 5 β
β continuous_sneezing,shivering,β¦ β 6 β 6 β
β chills,vomiting,high_fever,sweβ¦ β 4 β 4 β
β itching,skin_rash,fatigue,lethβ¦ β 9 β 9 β
β fatigue,cramps,bruising,obesitβ¦ β 5 β 5 β
βββββββββββββββββββββββββββββββββββ΄ββββββββ΄ββββββββββββββββββ
test_data_with_preds.filter(pl.col("label") != pl.col("predicted_label")).shape[0] / test_data_with_preds.shape[0]
0.02046783625730994
It looks like our model predicted the correct label with ~98% Accuracy.
Not bad!
Predictions using Pipeline
Another application of the trained model is to parse symptoms one by one, and retrieve a "diagnosis" on a case-by-case basis. HF's pipeline() class is very useful for this task.
First, let's bring back the label mapper and create an inverse mapper from it. This will be used to convert the numeric label that the model outputs to its matching text diagnosis:
label_mapper = {
'psoriasis': 0,
'arthritis': 1,
'cervical spondylosis': 2,
'impetigo': 3,
'malaria': 4,
'varicose veins': 5,
'allergy': 6,
'bronchial asthma': 7,
'dengue': 8,
'chicken pox': 9
}
label_mapper_inverse = dict()
for k, v in label_mapper.items():
label_mapper_inverse[v] = k
label_mapper_inverse
{0: 'psoriasis',
1: 'arthritis',
2: 'cervical spondylosis',
3: 'impetigo',
4: 'malaria',
5: 'varicose veins',
6: 'allergy',
7: 'bronchial asthma',
8: 'dengue',
9: 'chicken pox'}
Next, import HF's pipeline() and use it to define a pipe for out purpose. We will also create a diagnose_symptoms() function to drive the process.
from transformers import pipeline
pipe = pipeline(
task = "text-classification",
model = tuned_model,
tokenizer = tokenizer, # tokenizer defined above
device = "cuda" # to run on GPU
)
def diagnose_symptoms(symptom):
pred = pipe(symptom)
pred_int = int(pred[0]["label"].split("_")[1])
pred_diagnosis = label_mapper_inverse[pred_int]
pred_score = np.round(pred[0]["score"], 3)
return f"Diagnosis: {pred_diagnosis} (score: {pred_score})"
Now, we can use the diagnose_symptoms() functions to return a "diagnosis" based on symptoms (a classification score is also returned):
diagnose_symptoms("I have a skin rash. It is blotchy.")
'Diagnosis: impetigo (score: 0.677)'
diagnose_symptoms("My head hurts and I am coughing and sneezing")
'Diagnosis: allergy (score: 0.993)'
diagnose_symptoms("My neck hurts. I have back pain. I am constantly dizzy.")
'Diagnosis: cervical spondylosis (score: 0.996)'
Summary
In this article I used HuggingFace π€ to finetune an existing text classification model on medical diagnoses based on text symptoms. Obviously, the dataset I used was greatly truncated for the sake of computation time, but this exercise nevertheless showcases the vast potential of text classification methods in various fields, including medicine.
As the scope of LLMs expands, so will its inherent adoption in a large spectrum of professional applications. Therefore, the goal must always be to maximize the benefits of this adoption with continuous development.
Thanks for reading!