Classification of Medical Diagnoses based on Symptoms

Using Text Classification via HuggingFace 🤗 to Predict Medical Diagnoses based on Symptoms

October 13, 2024

DISCLAIMER:

I am not a medical doctor. This is an exercise in LLM text classification and should not be used for diagnostic purposes.

Nothing in this exercise should be taken as medical advice. Please consult your physician for actual medical advice.

I do not own the data used in this exercise and I was not involved in its collection.

Introduction

Classification of text data using LLMs is extremely useful as it makes it possible to convert natural language into a quantifiable and predictive format. There are countless use cases for text classification, and in this article I will be applying classification to medical diagnostic data.

The goal is to predict a "diagnosis" based on an input of symptoms.

The following libraries are used for this exercise. For specific instructions on using the latest PyTorch libraries see here.


$ pip install -U transformers datasets accelerate

$ pip install -U torch torchvision torchaudio

$ pip install -U polars

The Data

The first step is to pull the dataset from the HuggingFace Datasets hub. The dataset used in this exercise comes from the HuggingFace hub for user mohammad2928git and can be found here here.

Data is loaded using the Polars library, and the fields used will be text and label. Only observations where the text field is not null are kept:


import polars as pl

data_scan = (
    pl.scan_parquet("hf://datasets/mohammad2928git/complete_medical_symptom_dataset/data/train-*.parquet")
    .filter(
        pl.col("text").is_not_null()
    )
    .select(
        pl.col("text"),
        pl.col("lebel_text").map_elements(lambda x: x[0], return_dtype = pl.String).alias("label")
    )
)

data_polars = data_scan.collect()

print(data_polars.head(3))


shape: (3, 2)
┌─────────────────────────────────┬───────────────────────────┐
│ text                            ┆ label                     │
│ ---                             ┆ ---                       │
│ str                             ┆ str                       │
╞═════════════════════════════════╪═══════════════════════════╡
│ I have been having migraines a… ┆ drug reaction             │
│ I have asthma and I get wheezi… ┆ allergy                   │
│ Signs and symptoms of primary … ┆ premature ovarian failure │
└─────────────────────────────────┴───────────────────────────┘

For this exercise, and to keep training compute time to a minimum, the data is cut down to the top 10 most-common labels. This results in a data set of 1,706 unique observations:


top_ten_labels = (
    data_polars
    .group_by("label")
    .agg(
        pl.len().alias("n")
    )
    .sort("n", descending = True)
    .head(10)
    .select("label")
    .to_series()
    .to_list()
)

top_ten_labels


['psoriasis',
 'arthritis',
 'cervical spondylosis',
 'impetigo',
 'malaria',
 'varicose veins',
 'allergy',
 'bronchial asthma',
 'dengue',
 'chicken pox']


data_polars = (
    data_polars
    .filter(
        pl.col("label").is_in(top_ten_labels)
    )
)

data_polars.shape


(1706, 2)

An important step for training is to create a label mapper which will be used to map the labels to corresponding int values:


label_mapper = dict()

for i, label in enumerate(top_ten_labels):
    label_mapper[label] = i

label_mapper


{'psoriasis': 0,
 'arthritis': 1,
 'cervical spondylosis': 2,
 'impetigo': 3,
 'malaria': 4,
 'varicose veins': 5,
 'allergy': 6,
 'bronchial asthma': 7,
 'dengue': 8,
 'chicken pox': 9}

The data is then mapped with the label mapper, using the replace_strict() method from Polars:


data_polars = (
    data_polars
    .select(
        pl.col("text"),
        pl.col("label").replace_strict(label_mapper, return_dtype = pl.Int32)
    )
)

print(data_polars.head(3))


shape: (3, 2)
┌─────────────────────────────────┬───────┐
│ text                            ┆ label │
│ ---                             ┆ ---   │
│ str                             ┆ i32   │
╞═════════════════════════════════╪═══════╡
│ I have asthma and I get wheezi… ┆ 6     │
│ cough,high_fever,breathlessnes… ┆ 7     │
│ chills,vomiting,high_fever,swe… ┆ 4     │
└─────────────────────────────────┴───────┘

Personally, I like saving data to local storage at certain increments. This is purely optional.


data_polars.write_csv("data/data_polars.csv")

Next, split the data set into Train and Test sets:


import math

def split_train_test(pdf, train_frac = 0.8):
    train_size = math.floor(pdf.shape[0] * train_frac)
    pdf = pdf.sample(fraction = 1, shuffle = True)
    train, test = pdf.head(train_size), pdf.tail(-train_size)
    return train, test

data_train, data_test = split_train_test(data_polars)

Again, saving these splits as local objects in a local data directory is a good idea:


data_train.write_csv("data/data_train.csv")

data_test.write_csv("data/data_test.csv")

Next, the data is converted from a Polars object to a HuggingFace DatasetDict object to be used for the remainder of this exercise.

Here I am using the load_dataset() function from the HuggingFace datasets library, as this is the preferred method of loading datasets from the HF Datasets hub. The only difference is that since the splits are locally stored from the previous step, their local paths will be provided as input (as opposed to a HF url):


from datasets import load_dataset

ds_dict_paths = {"train": "data/data_train.csv", "test": "data/data_test.csv"}

ds_dict = load_dataset("csv", data_files = ds_dict_paths)

ds_dict


Generating train split: 1364 examples [00:00, 50494.53 examples/s]
Generating test split: 342 examples [00:00, 84728.41 examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1364
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 342
    })
})

Finally, split the Train set further into a Train & Validation set, which will be used in finetuning:


## Split:
ds_ttv = ds_dict["train"].train_test_split(train_size = 0.8, seed = 42)

## Rename the newly-created "test" to "validation":
ds_ttv["validation"] = ds_ttv.pop("test")

## Add the original "test" set to the new dataset:
ds_ttv["test"] = ds_dict["test"]

## No longer need original:
del ds_dict

ds_ttv


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1091
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 273
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 342
    })
})

Save the Train / Test / Validation set to disk:


ds_ttv.save_to_disk("data/ds_ttv")

This creates the following data structure inside your data directory:


ds_ttv/
├── dataset_dict.json
├── test
│   ├── dataset.arrow
│   ├── dataset_info.json
│   └── state.json
├── train
│   ├── dataset.arrow
│   ├── dataset_info.json
│   ├── indices.arrow
│   └── state.json
└── validation
    ├── dataset.arrow
    ├── dataset_info.json
    ├── indices.arrow
    └── state.json

With the data proccessing steps complete, and the data now available to be loaded using the load_from_disk() function, we move on to the Tokenization step.

Tokenization

Text data needs to be converted to a numeric format so that it can be used by the model, and Tokenization is the process by which this is accomplished. For a detailed look at HF's tokenizer implementations, refer to their docs.

First, load the dataset split from the last section. This can be done using the load_from_disk() function from the HuggingFace datasets library:


import torch

from datasets import load_from_disk

ds_ttv = load_from_disk("data/ds_ttv")

ds_ttv


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1091
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 273
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 342
    })
})

Next, define the tokenizer method and function that will be used to tokenize the data. This is done using the AutoTokenizer() function from the HuggingFace transformers library, and here I am using the "google-bert/bert-base-cased" model for tokenization. Additionally, the map() method is used to apply the tokenizer_function() to the Train / Test / Validation sets:


from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenizer_function(ds, text_column = "text"):
    return tokenizer(ds[text_column], padding = "max_length", truncation = True)

tokenized_ds_ttv = ds_ttv.map(tokenizer_function, batched = True)

tokenized_ds_ttv


Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1091/1091 [00:00<00:00, 8037.90 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 273/273 [00:00<00:00, 7628.45 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 342/342 [00:00<00:00, 8940.35 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1091
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 273
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 342
    })
})

Finally, a collator function is used to create batched samples. It will be created using the same Tokenizer from the previous step, and will be passed to the model Trainer in a subsequent step. The Data Collator is created using HF's DataCollatorWithPadding() method, which also applies the neccessary padding to each batch:


from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

Training

Fine-tuning the model via Training requires a few steps.

The starting model used here is the same "google-bert/bert-base-cased" model that was used in the tokenization step above. It is a classification model, which is loaded using HF's AutoModelForSequenceClassification() method. It is important to define the correct number of labels present in the data, for this exercise num_labels = 10.

The TrainingArguments() method is used to define the training args.


from transformers import AutoModelForSequenceClassification, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels = 10)

training_args = TrainingArguments(
    output_dir = "checkpoints",
    eval_strategy = "epoch"
)

It is also useful to define a compute_metrics() function for reporting during training:


import evaluate
import numpy as np

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis = -1)
    return metric.compute(predictions = predictions, references = labels)

The last step is to run the actual Training. For this purpose, the Trainer() class from HF is used, with the .train() method applied to it. Note that the Validation data set from an early step is passed as the evaluation set.

The Validation Loss and Accuracy from each of the three epochs is returned during the training process.

NOTE: On my machine this step takes ~4 minutes


from transformers import Trainer

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_ds_ttv["train"],
    eval_dataset = tokenized_ds_ttv["validation"],
    compute_metrics = compute_metrics,
    data_collator = data_collator
)

trainer.train()


[411/411 04:05, Epoch 3/3]
Epoch	Training Loss	Validation Loss	Accuracy
    1	No log	         0.185095	     0.978022
    2	No log	         0.063872	     0.985348
    3	No log	         0.038519	     0.989011

TrainOutput(global_step=411, training_loss=0.48913615057358195, metrics={'train_runtime': 246.8817, 'train_samples_per_second': 13.257, 'train_steps_per_second': 1.665, 'total_flos': 861224340436992.0, 'train_loss': 0.48913615057358195, 'epoch': 3.0})

Once training is completed, save the model to a local models directory to be used in subsequent steps:


trainer.save_model("models/google-base-cased-tuned-on-medical-text-top-ten")

Using the Trained Model

At this point we have a trained classifier model, stored in a local path at "models/google-base-cased-tuned-on-medical-text-top-ten". This model can now be used to perform classification on new data, which was not used to train the model.

NOTE: At this stage I restart my kernel to simulate a real-world scenario where your starting point is a tuned model. This is an optional step, but it is a good sanity check.

First, load the model using the same AutoModelForSequenceClassification() function, but this time make sure to provide the local path to the trained model.

As before, it is important to define the correct number of labels present in the data:


from transformers import AutoModelForSequenceClassification

tuned_model = AutoModelForSequenceClassification.from_pretrained(
    "models/google-base-cased-tuned-on-medical-text-top-ten",
    num_labels = 10
)

Next, load the new data (Test Data) that was created in the first section of this exercise. Again, this is to simulate a real-world scenario in which you are interested in runing a prediction on some new data.


import polars as pl

test_data = pl.read_csv("data/data_test.csv")

print(test_data.head())


shape: (5, 2)
┌─────────────────────────────────┬───────┐
│ text                            ┆ label │
│ ---                             ┆ ---   │
│ str                             ┆ i64   │
╞═════════════════════════════════╪═══════╡
│ I have prominent blood vessels… ┆ 5     │
│ continuous_sneezing,shivering,… ┆ 6     │
│ chills,vomiting,high_fever,swe… ┆ 4     │
│ itching,skin_rash,fatigue,leth… ┆ 9     │
│ fatigue,cramps,bruising,obesit… ┆ 5     │
└─────────────────────────────────┴───────┘

Tokenize this new data using the same tokenizer:


from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def make_token_list(text_list, use_tokenizer):
    return [use_tokenizer(t, truncation = True, is_split_into_words = False, return_tensors = "pt") for t in text_list]

test_data_tokenized_list = make_token_list(
    text_list = test_data.select(pl.col("text")).to_series().to_list(),
    use_tokenizer = tokenizer
)

Finally, run classification on this new data:


import numpy as np

def predict_token_list(mod, token_list):
    return [int(mod(t["input_ids"]).logits.argmax(-1)[0]) for t in token_list]

preds = predict_token_list(
    mod = tuned_model,
    token_list = test_data_tokenized_list
)

There are various methods for evaluation of predictions. For the sake of brevity for this exercise I am just looking at the percent of the labels classified correctly by the model. Of course, more stringent methods of evaluation can be applied.


test_data_with_preds = test_data.with_columns(pl.Series(name = "predicted_label", values = preds, dtype = pl.Int8))

print(test_data_with_preds.head())


shape: (5, 3)
┌─────────────────────────────────┬───────┬─────────────────┐
│ text                            ┆ label ┆ predicted_label │
│ ---                             ┆ ---   ┆ ---             │
│ str                             ┆ i64   ┆ i8              │
╞═════════════════════════════════╪═══════╪═════════════════╡
│ I have prominent blood vessels… ┆ 5     ┆ 5               │
│ continuous_sneezing,shivering,… ┆ 6     ┆ 6               │
│ chills,vomiting,high_fever,swe… ┆ 4     ┆ 4               │
│ itching,skin_rash,fatigue,leth… ┆ 9     ┆ 9               │
│ fatigue,cramps,bruising,obesit… ┆ 5     ┆ 5               │
└─────────────────────────────────┴───────┴─────────────────┘


test_data_with_preds.filter(pl.col("label") != pl.col("predicted_label")).shape[0] / test_data_with_preds.shape[0]


0.02046783625730994

It looks like our model predicted the correct label with ~98% Accuracy.

Not bad!

Predictions using Pipeline

Another application of the trained model is to parse symptoms one by one, and retrieve a "diagnosis" on a case-by-case basis. HF's pipeline() class is very useful for this task.

First, let's bring back the label mapper and create an inverse mapper from it. This will be used to convert the numeric label that the model outputs to its matching text diagnosis:


label_mapper = {
    'psoriasis': 0,
    'arthritis': 1,
    'cervical spondylosis': 2,
    'impetigo': 3,
    'malaria': 4,
    'varicose veins': 5,
    'allergy': 6,
    'bronchial asthma': 7,
    'dengue': 8,
    'chicken pox': 9
}

label_mapper_inverse = dict()

for k, v in label_mapper.items():
    label_mapper_inverse[v] = k

label_mapper_inverse


{0: 'psoriasis',
 1: 'arthritis',
 2: 'cervical spondylosis',
 3: 'impetigo',
 4: 'malaria',
 5: 'varicose veins',
 6: 'allergy',
 7: 'bronchial asthma',
 8: 'dengue',
 9: 'chicken pox'}

Next, import HF's pipeline() and use it to define a pipe for out purpose. We will also create a diagnose_symptoms() function to drive the process.


from transformers import pipeline

pipe = pipeline(
    task = "text-classification",
    model = tuned_model,
    tokenizer = tokenizer,   # tokenizer defined above
    device = "cuda"          # to run on GPU
)

def diagnose_symptoms(symptom):
    pred = pipe(symptom)
    pred_int = int(pred[0]["label"].split("_")[1])
    pred_diagnosis = label_mapper_inverse[pred_int]
    pred_score = np.round(pred[0]["score"], 3)
    return f"Diagnosis: {pred_diagnosis} (score: {pred_score})"

Now, we can use the diagnose_symptoms() functions to return a "diagnosis" based on symptoms (a classification score is also returned):


diagnose_symptoms("I have a skin rash. It is blotchy.")


'Diagnosis: impetigo (score: 0.677)'


diagnose_symptoms("My head hurts and I am coughing and sneezing")


'Diagnosis: allergy (score: 0.993)'


diagnose_symptoms("My neck hurts. I have back pain. I am constantly dizzy.")


'Diagnosis: cervical spondylosis (score: 0.996)'

Summary

In this article I used HuggingFace 🤗 to finetune an existing text classification model on medical diagnoses based on text symptoms. Obviously, the dataset I used was greatly truncated for the sake of computation time, but this exercise nevertheless showcases the vast potential of text classification methods in various fields, including medicine.

As the scope of LLMs expands, so will its inherent adoption in a large spectrum of professional applications. Therefore, the goal must always be to maximize the benefits of this adoption with continuous development.

Thanks for reading!