Submitted Q&A

Weekly Q&A will be updated every Tuesday and Friday by 9:00 pm CST.

Competition Related

What needs to be submitted by the October 11, 2020 deadline?

Two items will need to be submitted by the October 11th submission deadline:

  • First is the scored CSV file containing your modeling algorithm applied to the provided HOLDOUT file. The CSV file will be used in Round 1 to determine the top 50 submissions moving onto Round 2.
  • The second item that needs to be submitted is your written report documenting your efforts, findings, and recommendations. This written report will be used in Round 2 to determine the top 5 submissions moving on to the Finals.

Regarding Round 1 CSV submission, what does RANK refer to?

RANK refers to the observation number after all the records in the HOLDOUT file have been ordered from highest score (most likely to have transportation need) to lowest score (least likely to have transportation need).  Therefore, the highest scoring record will have a RANK = 1, the 2nd highest scoring record will have RANK = 2, etc.

Will the result of the Individual RANK be graded?

Including the RANK of an individual record helps ensure the file is ordered appropriately which will result in the creation of the ROC curve and subsequent AUC metric.  The AUC metric associated with your scored HOLDOUT file will be the measure used to identify the TOP 50 submissions moving onto Round 2.

Can we use any program or language that we would like in organizing the data? In other words, are there any restrictions as to how we do our analysis?

There is no restriction on how you do your analysis. The rules state, however, “Students may use any of the following analytical tools: SAS, SPSS, R, Python, Matlab, and Microsoft Excel”

Is this holdout provided what we are judged on and what we provide scores and rankings for?

Correct. The holdout file will be used in round 1 to determine the Top 50 entries based on AUC.

Is the PowerPoint due along with our model and write-up at the time of submission on October 11th?

No, the PowerPoint is not due on Oct 11th. The PowerPoint will be required for the final presentations on 11/12 and are only expected from the Top 5 finalists.

What, if any, analysis are we expected to run on the holdout data for our submission?

The only thing you need to do with the holdout file is to score it with your ‘final’ model. Once scored, you will provide the holdout file, along with the write-up for your completed submission on 10/11. We will use your scored file to evaluate the performance of your model using ROC/AUC metrics. This will determine if your team will make the top 50 submissions and move onto round 2 of judging.

Our team members are wondering how much AUC shall we reach to be the top 50. Do we have a place to see the current scores like Kaggle? It’s very important for us to decide on our direction of workings.

We can’t tell you what the cut-off will be for AUC top 50 given it will be relative to all received submissions. However, there will not be a ‘leader board’ of AUCs to compare your model…just like in the real-world. Build and submit the model you think has good accuracy AND provides insights that can be used as a foundation for your recommendations.

Trasportation_issue isn’t listed in the holdout data. Do you have data separately with you so that you can calculate accuracy with predicted values of Teams participating?

We have the transportation responses for the holdout file which will be used when scoring.

Can this dataset be used in some other research?

NO.  Refer to the NDA agreement for the competition.



Attribute/Data Specific

For the columns that start with “credit_” , I’m just not sure what the values represent. Like “credit_num_1stmtg_30to59dpd” is labeled as “Number 1st Mortgage Accts – 30 to 59 dpd”. I’d expect it to be a whole number because it seems like it is counting number of accounts past due. but the max value is this column is 0.0333.

These features were originally at the zip+4 level, but for this project they had to be rolled up to the zip level and hence the resulting feature result is no longer a whole number.

For the column labeled ‘cons_hcaccprf_p’, the values are supposed to be [P, C, H, O] but in the Training set they are all 1s and 0sA.

Here are the updated descriptions for these features:

  • CONS_HCACCPRF_P – would be a 1 if this member is likely to have a preference to go to a personal doctor or personal care physician for their health care needs
  • CONS_HCSCCPRF_H – would be a 1 if this member is likely to have a preference to go to a hospital or standalone emergency room or urgent care center for their health care needs

A couple of codes’ definitions were not defined in the data documentation excel sheet. For example, the ‘con%’ (only a couple are defined in the KBM tab, what about the rest?), ‘bh%’, ‘prov%’, & ‘rev%’ columns. Where does the reference table sit? I don’t seem to find any reference table in the downloaded zip file. I could only find the MCC reference table, but not for the behavioral health’s, PDC, and REV variables.

Here are the updated descriptions for the BH features. PROV, PDC, and REV features should be determinable using information in the feature name, the feature description, and the “Acronyms_Terms” tab:

bh_adtp_ind – binary indicator for each of the BH Categories – Post-Traumatic Stress Disorder
bh_aoth_ind – binary indicator for each of the BH Categories – Other Anxiety Disorder
bh_bipr_ind – binary indicator for each of the BH Categories – Bipolar Disorder
bh_cdal_ind – binary indicator for each of the BH Categories – Alcohol Abuse
bh_cdsb_ind – binary indicator for each of the BH Categories – Substance Abuse
bh_cdto_ind – binary indicator for each of the BH Categories – Tobacco Use Disorder
bh_dema_ind – binary indicator for each of the BH Categories – Major Depressive Disorder

How is the cons_n2pmv (% Motor Vehicle ownership) variable measured? Do the values (0-99) represent percentiles?

The variable was derived from census data which was collected at census tract level. The values represent percentages rather than percentiles. It is % of households in the Census Tract or Census Block-Group with the attribute.

Could you please provide the meaning of the following acronyms/shorthands?

  • Behavioral health features (starts with “bh_”): adtp, aoth, cdal, cdsb, cdto.
    • Refer to the update variable descriptions.
  • Physician E&M features (starts with “phy_em_”): pe, pi, px
    • PE = Emergency, PI = Inpatient, PX = Outpatient/Office
  • hcc (feature name is hcc_weighted_sum)
    • CMS Hierarchical Condition Category. HCC_WEIGHTED_SUM represented sum of weighted condition categories. The weighting is based on the severity of the condition.

What does ‘other’s in zipcode, county and state mean?

The “Other” in zipcode is an artifact of the synthetic data relating to sparse zip codes.

The dataset shows “*”. What does that mean?

The “*” is an artifact of the synthetic data relating to sparse categorical values.

PDC calculates the ratio of number of days the patient is covered by the medication in a period to the total number of days in the period. Why is the max value PDC value for the variable ‘pdc_ast’ greater than 1 (max = 1.1)? Shouldn’t the max value be capped to 1. What does a value of 1.1 represent?

1.1 is an imputed value when PDC was missing

How do the variables “betos_*_ct” have fractional values?

The value is decimal, because the total count was divided by the number of months that member stayed with Humana during the past 12 months prior to survey date.

What date was the survey administered?

Most survey were completed in Nov/Dec 2019.

Is “transportation challenge” a self-reported value?


What are the revenue codes of the CMS categories (Rows 270-276)? Is there a reference table for those? Additionally, where is the CMS Level 2 diagnosis reference table for those categories?

Revenue codes of the CMS categories could be found in description tab.

cmsd2_gus_m_genital_ind DISEASES OF MALE GENITAL ORGANS
cmsd2_men_mad_ind MOOD [AFFECTIVE] DISORDERS
cmsd2_mus_polyarthropath_ind INFLAMMATORY POLYARTHROPATHIES
cmsd2_mus_spondylopath_ind SPONDYLOPATHIES
cmsd2_sns_general_ind GENERAL SYMPTOMS AND SIGNS

How are pmpm- per member per month calculated? Can we know the total cost for providing the coverage and member months?

Utilization per member per month by calculating using total utilization divided by the number of months that member stayed with Humana during the past 12 months prior to survey date. Since the member months information was not included in the dataset, total cost can’t be calculated.

We have en, spa, e. What does ‘e’ mean?

Here are the only three categories that you would see from the dataset. Any other categories can be treated as miscoding from the source data.

SPA Spanish
ENG English
OTH Other

For the column ‘src_platform_cd’, what do  ‘EM’  and ‘LV’  mean?

The Source Platform Code identifies the platform ‘owner’ or where the data is administered from. ‘EM’ – Metavance ‘LV’ – Louisville.  Platform that processes the claims.

How are the demographic percentage variables like cons_n2pmv (Census % Motor Vehicle Ownership) calculated? Are they calculated the percent in an area or just inside a household?

It is % of households in the Census Tract or Census Block-Group with the attribute.

What does dpd stand for with all predictors related to ‘credit’?

Days Past Due

‘hcc_weighted_sum’ represents Sum of weighted existing HCC categories present in the reference table for each member.  What is the significance of HCC risk score, and does the order of increasing risk score matter?

HCC Risk Score would correlate with the health of the member.

The Utilization has two sets of data: total and Non-BH broken by utilization category. Can we please know the relation between these.(how they derived etc)?

Total utilization included any type pf utilization (Behavioral Health related or Non Behavioral Health related), while Non-BH just indicated Non Behavioral Health related utilization.  Because of the synthetic data process, Non-BH and BH may not exactly sum to the Total.

Does ‘NA’ under hedis_ami column mean Null (missing value)?

‘Y’: compliant
‘N’: not compliant
‘NA’: not applicable of the member

What’s the relationship between the rx features on the same level? Is there overlap? Or are those features totally different? i.e. branded VS generic rx features

It depends. Branded vs generic features are not overlapped, while mailed, maint, and otc might be overlapped.

Imputation of missing values:

  • We had an assumption that missing values were missing at random
  • Almost all features which were derived from claims (both numeric and binary indicator) were automatically imputed with 0 if we didn’t get any claims from that member during model lookback period. It’s a reasonable imputation since if a member didn’t have a claim related to a certain condition, it usually represent that member don’t have that condition.
  • PDC features: 1.1 is the default value for when a member is either ineligible for medication adherence or doesn’t have at least 2 prescriptions (min required to make the calculation) in a particular category. This is the only one set of features that we used an unique numeric value other than 0 to make imputation, since we want to differentiate those Not Applicable members with those members where their PDC closed to 0.
  • HEDIS features: Members needs to reach eligibility criteria to consider as whether compliant to a certain HEDIS measure. Members who are either ineligible or doesn’t have claims data would have a value of “NA”.
  • KBM features: we didn’t make any imputation on KBM features (“cons_xxx”). If you have seen “UNK” value, that is directly from the source data rather than the imputation made from us.

How did you all rate the mabh_segm column as H1, H2, till C8. What was the criteria used to rate someone under H1, H2 etc? Is mabh_segm a derived value based on many other column present in the dataset?

The Medicare Segmentation was an analytic segmentation created several years ago.  It leverages many variables about our Medicare members in order to place them into a segment.  Some of variables used in the segmentation are included here, but not necessarily all as the contest variable list was limited to approximately 800 features.