We are having problems reading the contents of the “Readme.txt” file. What does it say?

Here is all of the information in that file:

  • 2023_Competition_Training.csv       = Data to be used for analysis & model development
  • 2023_Competition_Holdout.csv       = Holdout data to be scored with final model and results returned for mid-cycle leaderboard and/or Oct 16 submission
  • Humana_Mays_2023_DataDictionary.xls       = File Statistics, File Layout, descriptions of attributes for each event type

When will the dataset be available?

Competition data will be distributed to registered and verified teams starting after September 13, 2023, Informational Call and ending after the registration deadline of September 22, 2023.  (Typically, data will be available no more than 48 hours after registration & verification)

What format will the dataset be in?

The dataset will be available in a CSV file, along with a data dictionary.

Are we allowed to use publicly available data to help us in this case competition?

Yes. Students are encouraged to use open-source data when creating a solution.

Explain why there are instances in the claims data where the process date is less than the service date. 

You are correct in that the process date should not occur prior to the service/visit date.  2 things may be happening: (a) there is a problem with your data (i.e. read-in or join error) (b) the data is erroneous.  Data is messy. You must decide which it is and how to handle it.

What is the distinction between “medclm_key” and “clm_unique_key” in the context of the “medclms_trian” dataset, and why does “medclm_key” seem to have a unique value for every row while “clm_unique_key” has duplicate values?

You should think of medclm_key as the primary key for the medical claims table and it is unique for every claim line.  The clm_unique_key groups together a single “claim” which can consist of multiple “claim lines”.  These unique claims can be combined together to form a logical claim that group together claims from the same provider/member combo with overlapping service dates.  We typically use the logical claim to count utilization/visits rather than clm_unique_key.

Are we allowed to utilize a Private Github repository to share access to data between team members?

Yes. However, make sure that data is not public as it would result in a violation of NDA.

Are the two claim datasets claims filed by healthcare providers or by patients?

The way this usually works for medical claims is this: Someone goes to the doctor, they show their insurance card to the doctor, and after the appointment, a billing person submits a claim to the insurance company.

For prescription claims, it’s a lot faster but generally the same process. When a patient fills a claim at a pharmacy, the pharmacy submits the claim to the insurance company.

Since Humana is the insurance company, this is the data we have. Generally, it comes from the provider or pharmacy directly to Humana.

How much on average does the insurance cover for patients in both medical claims and prescription drugs? Do participants have access to that data? 

Coverage amounts vary on myriad of factors including, but not limited to, the plan an individual member has chosen.  However, for the purpose of this case competition, those details are not included in the data that was provided to the participants.


Why might one prioritize AUC over recall in this context? To elaborate, when we focus on maximizing recall, there is a slight decrease in AUC, but we end up capturing a larger portion of the positive class. Despite the model’s lower precision, the subsequent false positives we investigate could still be advantageous, even if we acknowledge that they might not influence the treatment decision

It’s not that we believe AUC is more important, it is simply the measure of accuracy chosen for Round 1 evaluation where every team’s model will be evaluated using the same measuring stick.  However, feel free, in subsequent rounds, to make a case for a model that maximizes recall, why that makes sense, and what (if any) implications it carries.


Is there more information about Tagrisso and how it is administered that you can provide?

More information on Tagrissio can be found in the following fact sheet : Tagrisso PI 6.2023


Other FAQ’s: General GuidelinesFairness, Registration, Deliverables, All