Deidentification

Deidentification Overview

There are many instances in which you would want to de-identify your data before ingesting it into Flywheel. The main reason is if your acquired data from the scanner is not already de-identified (ie, if you don’t scan at INC). Below is a brief summary of several ways to de-identify your data before it enters Flywheel, or once it’s already in Flywheel.

Warning

The UCB instance of Flywheel relies on storage (PetaLibrary) that can have no personal identifying information on it. Therefore, it is crucial to de-identify your data if it was collected somewhere outside of INC.

Deidentification on Edge

De-identification on edge refers to wiping out any identifying information in dicom headers at the source (before those dicoms are ever uploaded to Flywheel). That source can be your institution’s data storage cluster, or your institution’s scanner. In either case, Flywheel will have no record of any personal identifying information (PII) in your dicom headers if de-identification happens on edge.

Fortunately, through the Flywheel command line interface tool (CLI), Flywheel can assist in the de-identification process without “knowing” about your data or registering it in the Flywheel database.

Pre-requisites

To get started, you’ll need to download the Flywheel CLI on the machine where your data resides, or on a machine that has mounted access to where your data resides: visit the CLI Basics page.

Before getting started, you’ll also need some way to read the existing header information in your dicom files (for example, DCMTK’s dcmdump tool). Alternatively, if you already know the header codes and header information, you can forgo looking it up with the dicom header reader.

Once the Flywheel CLI is downloaded (and your personal API key generated), you are ready to start ingesting de-identified data into Flywheel.

Simple De-identification

If the only identifiable information in your dicoms exists in the Patient ID, Patient Name, and Patient Birthdate, Flywheel has a pre-built flag that simply removes those dicom header fields from every dicom file that is uploaded.

Suppose we want to upload a dicom acquisition from our local computer onto Flywheel. We first inspect the header information in that acquisition (just checking one dicom in the acquisition is enough). All of the following commands are run in a Terminal or DOS prompt (aka command line interface):

dcmdump <path_to_single_dicom>

For example:

dcmdump just_t1_dummy_ids/ics/deid/deid02/inc01/anat-T1w_acq-mpr08_run-01/1.3.12.2.1107.5.2.43.67087.2022101911212761306002309.0.0.0.dicom/1.3.12.2.1107.5.2.43.67087.2022101911284077703103114.MR.dcm

The returned output below (a partial dump of the dicom header) makes it clear that there is indeed identifying information:

Dicom Code

Field Name

Value

(0010,0010)

Patient’s Name

John

(0010,0020)

Patient ID

Doe

(0010,0030)

Patient’s Birth Date

19010101

(0010,0040)

Patient’s Sex

F

(0010,1010)

Patient’s Age

121Y

To upload this acquisition but without the Patient’s Name, Patient ID, or Patient’s Birth Date, the Flywheel command is simply to add the --de-identify flag onto the fw ingest dicom command.

fw ingest dicom --de-identify --subject <subject_ID> --session <session_ID> <path_to_dicom_directory> <Flywheel Group> <Flywheel Project>

For example:

fw ingest dicom --de-identify --subject deid01 --session inc01 just_t1_dummy_ids/ics/deid/deid02/inc01/anat-T1w_acq-mpr08_run-01/1.3.12.2.1107.5.2.43.67087.2022101911212761306002309.0.0.0.dicom ics deid

More details about the fw ingest dicom command are given on our CLI Basics page.

If you now check your upload in Flywheel by clicking on the information button next to the dicom, you’ll see that the Patient’s Name, Patient ID, and Patient Birth Date simply don’t exist as metadata.

Note

While the Patient’s Name, Patient ID, and Patient’s Birthdate have been de-identified, if you had any other identifying information in your dicom header, it is now in Flywheel. Not good. For scrubbing other identifying metadata from the dicom header, read on.

De-identification Profile

Creating a de-identification profile allows you to have maximum control over how each and every piece of metadata in your dicom header is handled. Creating a de-identification profile can be a daunting task at first - not because it’s difficult - but because of how many different options you have at every decision. Some of the basic options include: do nothing, increment all age-related numbers by some given amount, delete the data entirely, replace the data with something different, jitter the data if it’s a number, and hash the data.

Flywheel has extensive information about how to create a de-identification profile on their de-id documentation website.

In short, this de-identification profile is nothing more than a YAML file. The example below shows a de-identification profile that does the following:

  • Sets up a place to store the de-id log (a csv file that allows you to re-identify your data, should you need to)

  • Increments each date field by subtracting 17 days

  • Sets the patient’s age to use Year units

  • Calculates the Patient’s Age from the Patient Birth Date dicom tag

  • Removes the PatientID dicom tag

  • Replaces the dicom tag StationName with XXXXX

  • Hashes the AccessionNumber and ConcatenationUID dicom tags

# the de-id updates before uploading
# The option is ignored in ingest, you can use --save-deid-logs PATH to save the log.

deid-log: ./deid_log.csv

# Sets the filetype to DICOM

dicom:

  # Date-increment controls how many days to offset each date field
  # where the increment-date (shown below) is configured.
  #Positive values will result in later dates, negative
  # values will result in earlier dates.

  date-increment: -17

  # patient-age-from-birthdate sets the DICOM header as a 3-digit value with a suffix
  # be 091D, and that same age in months would be 003M. By default, if
  # the age fits in days, then days will be used,
  # otherwise if it fits in months, then months
  # will be used, otherwise years will be used

  patient-age-from-birthdate: true

  # Set patient age units as Years. Other options include months (M) and days (D)

  patient-age-units: Y

  # The following are field transformations.
  # Remove, replace-with, increment-date, hash, and hashuid can be used with any DICOM
  # field. Replace name with the DICOM field "keyword" by the DICOM standard
  fields:

    # Use remove Remove a dicom field Removes the field from the DICOM entirely.
    # If removal is not supported then this field will be blank.
    # This example removes PatientID.

    - name: PatientID
      remove: true

    # Replace a dicom field with the value provided.
    # This example replaces “StationName” with "XXXX" in Flywheel

    - name: StationName
      replace-with: XXXX

    # Offsets the date by the number of days defined in
    # the date-increment setting above, preserving the time
    # and timezone. In this example, StudyDate appears as 17 days earlier

    - name: StudyDate
      increment-date: true

    # One-Way hash a dicom field to a unique string

    - name: AccessionNumber
      hash: true

     # Replaces a UID field with a hashed version of that
     # field. The first four nodes (prefix) and last node
     # (suffix) will be preserved, with the middle being
     # replaced by the hashed value

    - name: ConcatenationUID
      hashuid: true

Testing the De-identification Profile

Once you’ve created your de-identification profile, Flywheel also has a command line interface tool that allows you to test your profile on sample data before using it more broadly for the final dicom upload.

Extensive documentation on testing your de-id profile exists on the Flywheel site as well as a brief summary below.

In the previous section, we created a de-identification profile that we called deid_profile.yaml. Suppose we now want to test how this profile transforms one of our example dicom directories, and store the results of this transformation in a directory we call deid_test_dir. Below is the Flywheel command line call that performs the aforementioned steps:

fw deid test <path_to_dicom_directory_to_deid> <path_to_deid_yaml_profile> <path_to_directory_for_test_results> --session <session_ID> --subject <subject_ID>

For example:

fw deid test just_t1_dummy_ids/1.3.12.2.1107.5.2.43.67087.2022101911212761306002309.0.0.0.dicom deid_profile.yaml deid_test_dir --session inc01 --subject deid02

The result of this call from the terminal creates a csv file called deid_log.csv in the directory deid_test_dir. The CSV file shows a before and after (what each dicom header field was before the transformation, and what it became after the transformation). When you first build a de-id profile, it’ll be an iterative process of testing the profile to make sure you have captured all the desired transformations and haven’t left any identifying information in the dicom header.

Uploading the De-ID Profile To You Flywheel Project

Once you have created and tested your de-identification profile, you can ask the INC team to upload your profile to the relevant Flywheel Project. Once the profile exists as an attachment in your Project settings, any upload you perform (via the GUI, SDK, or CLI) for that Project will first be de-identified based on the rules you laid out in your profile.

Alternatively, if you’d rather keep your de-id profile secret, want to apply different de-id profiles for different subjects, etc, it’s best to continue to the next section which describes how to upload data while the de-id profile remains local to your personal computer (not in Flywheel).

The Ingest Config Template

Now that you’ve put in the hard work into making the perfect de-id profile, you’d like to use it for an actual data upload. However, if you opted not to upload the de-id profile to Flywheel, there’s one more step: the Ingest Config Template.

The ingest config template is a broad topic in and of itself (best described in the Flywheel template documentation).

Briefly, the ingest template is a configuration YAML file that allows you to control every part of the Flywheel upload process. For example, the ingest template defines the relationship between your local folder structure (where the data exists) and how you want that data to be labelled and mapped onto the Flywheel data hierarchy. Critically, the ingest config template also defines the de-identification profile(s) for the given Project.

The ingest template is its own topic and won’t be covered in this section; however, to apply the de-id profile you created and tested, you simply need to paste it into the ingest config template. Below is an example of an ingest config template titled config.yaml. Notice the copied and pasted de-identification profile we worked on in previous sections.

####
# Template and Group/Project Settings
#####

template:
  - pattern: "{group}"
  - pattern: "{project}"
  - pattern: "{subject}"
  - pattern: "{session}"
  - pattern: "{acquisition}"
    packfile_type: dicom

#####
# Optional includes/excludes for directories and files
#####

# Patterns of directories to include
# include-dirs:
# - "*.dicom"

# Patterns of filenames to exclude
# exclude:
# - "*.txt"
# - "*.xml"

#####
# De-identification Settings
#####
deid-profiles:
- name: Anschutz

  # Indicates where you want to place the de-id log. You will use this log file to preview
  # the de-id updates before uploading
  # The option is ignored in ingest, you can use --save-deid-logs PATH to save the log.

  deid-log: ./deid_log.csv

  # Sets the filetype to DICOM

  dicom:

    # Date-increment controls how many days to offset each date field
    # where the increment-date (shown below) is configured.
    #Positive values will result in later dates, negative
    # values will result in earlier dates.

    date-increment: -17

    # patient-age-from-birthdate sets the DICOM header as a 3-digit value with a suffix
    # be 091D, and that same age in months would be 003M. By default, if
    # the age fits in days, then days will be used,
    # otherwise if it fits in months, then months
    # will be used, otherwise years will be used

    patient-age-from-birthdate: true

    # Set patient age units as Years. Other options include months (M) and days (D)

    patient-age-units: Y

    # The following are field transformations.
    # Remove, replace-with, increment-date, hash, and hashuid can be used with any DICOM
    # field. Replace name with the DICOM field "keyword" by the DICOM standard
    fields:
      # Use remove Remove a dicom field Removes the field from the DICOM entirely.
      # If removal is not supported then this field will be blank.
      # This example removes PatientID.

      - name: PatientID
        remove: true

      # Replace a dicom field with the value provided.
      # This example replaces “StationName” with "XXXX" in Flywheel

      - name: StationName
        replace-with: XXXX

      # Offsets the date by the number of days defined in
      # the date-increment setting above, preserving the time
      # and timezone. In this example, StudyDate appears as 17 days earlier

      - name: StudyDate
        increment-date: true

      # One-Way hash a dicom field to a unique string

      - name: AccessionNumber
        hash: true

       # Replaces a UID field with a hashed version of that
       # field. The first four nodes (prefix) and last node
       # (suffix) will be preserved, with the middle being
       # replaced by the hashed value

      - name: ConcatenationUID
        hashuid: true

Putting it All Together

The last step once the de-id profile and the template config YAML are ready, is to make the actual call to Flywheel to upload your dicoms. This is done either with a call to fw ingest dicom or with fw ingest template.

To use fw ingest dicom to upload the example data to Flywheel Group ics, Flywheel Project deid, Flywheel Subject deid03, and Flywheel Session inc01, using our created config file config.yaml which includes the de-ide profile named Anschutz, we use the following command line call:

fw ingest dicom --config-file config.yaml --de-identify --deid-profile Anschutz --subject deid03 --session inc01 ./just_t1_dummy_ids/ics/deid/deid02/inc01/anat-T1w_acq-mpr08_run-01/1.3.12.2.1107.5.2.43.67087.2022101911212761306002309.0.0.0.dicom ics deid

More options are available with fw ingest template, but to accomplish the same upload as above, the command line argument is:

fw ingest template -C config.yaml ./just_t1_dummy_ids --group ics --project deid --de-identify --deid-profile Anschutz

Deidentification From the Scanner

It is also possible to create a profile that applies to data coming directly from the scanner. If this is of interest, please contact Lena or Amy.

Deidentification Gear in Flywheel

Lastly, there’s an additional option to de-identify the data once it’s already in Flywheel by running the de-id gear. However, Flywheel has version control on files (including DICOM files), so a copy of your “identifiable” data before the de-id gear was run will exist somewhere in Flywheel (even if it is not accessible to all users). Since the CUB instance of Flywheel can’t have any identifiable information at any time, running the de-id gear is not an option we advertise on our site.