Suppose the data is -
"King Khaled Hospital """"NG"""""
length (including quotes and spaces) = 33
In above example:
- Keep the starting and ending quotes intact means position 1 & position 33
- before NG, there are 4 double quotes. So make them only 1.
- After NG, there are 5 double quotes at the position of 29, 30, 31, 32, 33. 33th double quotes should be remain as it is beacause it is ending one. 29, 30, 31, 32 should convert into 1 double quotes.
Final output:
"King Khaled Hospital "NG""
length = 27
It is just an exaple. Data can be anything.
Let me give one post example:
My df:
>>>
>>> df.select('full_nm').where(df['eid'] == '60806998').show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|full_nm |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""King Khaled Hospital """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""NG"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Final output:
"King Khaled Hospital "NG""
If I am using below code:
>>> df_temp = df.withColumn('full_nm', regexp_replace('full_nm', r'("{2,})([^"]+)\1', r'"$2"'))
>>> df_temp.select('full_nm').where(df_temp['eid'] == '60806998').show(truncate=False)
23/11/13 08:58:51 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|full_nm |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|"King Khaled Hospital ""NG"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Setup
Let us use regexp_replace to substitute the two or more consecutive occurrences of quotes with a single quote.Just keep in mind that here I am assuming that the strings in your dataset is always sorrounded by same number of quotes for instance if a word is preceded by 4 quotes then it must be followed by 4 quotes
Result