How to replace encoded missing values as missing values in rapidminer?

115 Views Asked by At

I am currently working in covid data analysis using the dataset from COVID-19 DATASET. I am using Rapidminer for this project. In this dataset, the missing values are labeled as 97,98,99 in all other column and the death_year column has missing values as 9999-99-99.

I am trying to replace the missing values in the manually by chaining the replace operator but the system doesn't show all of the operator. Rapidminer system design

I am tasked to do some eda and ml operation in this dataset. But for that removing or dealing with missing values are deemed necessary.

1

There are 1 best solutions below

0
Andrew Chisholm On

You can use the Declare Missing Value operator. This works for nominal and numeric attributes and you state what value should be missing throughout the example set.

Here's an example XML process (copy the XML below and paste into an empty RapidMiner XML pane)

<?xml version="1.0" encoding="UTF-8"?><process version="10.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="10.1.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="utility:create_exampleset" compatibility="10.1.001" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
    <parameter key="generator_type" value="comma separated text"/>
    <parameter key="number_of_examples" value="100"/>
    <parameter key="use_stepsize" value="false"/>
    <list key="function_descriptions"/>
    <parameter key="add_id_attribute" value="false"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
    <parameter key="time_zone" value="SYSTEM"/>
    <parameter key="input_csv_text" value="aa,bb,cc,dd&#10;99,3.1,23,2023-09-12&#10;98,97,1.1,2023-09-13&#10;3.1,2,1,9999-99-99"/>
    <parameter key="column_separator" value=","/>
    <parameter key="parse_all_as_nominal" value="false"/>
    <parameter key="decimal_point_character" value="."/>
    <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="declare_missing_value" compatibility="10.1.001" expanded="true" height="82" name="Declare Missing Value (4)" width="90" x="246" y="136">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value="dd"/>
    <parameter key="attributes" value="|a4"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    <parameter key="mode" value="nominal"/>
    <parameter key="numeric_value" value="99.0"/>
    <parameter key="nominal_value" value="9999-99-99"/>
    <parameter key="expression_value" value="a4==&quot;bbbb&quot;"/>
      </operator>
      <operator activated="true" class="declare_missing_value" compatibility="10.1.001" expanded="true" height="82" name="Declare Missing Value (3)" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="numeric"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="numeric_condition" value="&gt; 97"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    <parameter key="mode" value="numeric"/>
    <parameter key="numeric_value" value="97.0"/>
    <parameter key="expression_value" value="&gt; 97"/>
      </operator>
      <operator activated="true" class="declare_missing_value" compatibility="10.1.001" expanded="true" height="82" name="Declare Missing Value" width="90" x="514" y="34">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="numeric_condition" value="&gt;97"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    <parameter key="mode" value="numeric"/>
    <parameter key="numeric_value" value="98.0"/>
    <parameter key="expression_value" value="a1==99"/>
      </operator>
      <operator activated="true" class="declare_missing_value" compatibility="10.1.001" expanded="true" height="82" name="Declare Missing Value (2)" width="90" x="648" y="34">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="numeric_condition" value="&gt;97"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    <parameter key="mode" value="numeric"/>
    <parameter key="numeric_value" value="99.0"/>
    <parameter key="expression_value" value="a1==99"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Declare Missing Value (4)" to_port="example set input"/>
      <connect from_op="Declare Missing Value (4)" from_port="example set output" to_op="Declare Missing Value (3)" to_port="example set input"/>
      <connect from_op="Declare Missing Value (4)" from_port="original" to_port="result 2"/>
      <connect from_op="Declare Missing Value (3)" from_port="example set output" to_op="Declare Missing Value" to_port="example set input"/>
      <connect from_op="Declare Missing Value" from_port="example set output" to_op="Declare Missing Value (2)" to_port="example set input"/>
      <connect from_op="Declare Missing Value (2)" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

The process should look like this.

Process

There are 4 examples of the Declare Missing Values operator. The first replaces 9999-99-99 in all attributes in all examples. The remaining 3 do the same for the numeric values 97, 98 and 99 respectively.

The output should look like this before replacements.

Before

and like this after.

After