How to split the word "ActionAction-AdventureShooterStealth" into list of separate words?

75 Views Asked by At

Question: The genres column contains the genres that are present in the games. It has all the genres written together without any space or special characters. Whatever is the major genre of the game is given first, followed by the other genres. For better understanding refer to the table below. Game Genres A ActionComedyAdventure B AdventureComedy C NarrationShooting

In the above table, the major genres for the Games A, B and C are Action, Adventure and Narration respectively.

Your job is to extract the major genre for each game and store it in a new column and name the column as “Major Genre”. (Hint: All the genre name starts with uppercase).

I want to split the word "ActionAction-AdventureShooterStealth" into a list of words in the below format

['Action', 'Action-Adventure', 'Shooter', 'Stealth']

I tried the below approach but didn't work out

text = "ActionAction-AdventureShooterStealth"
res = text.split(',')
print(res)

3

There are 3 best solutions below

0
On

One way to do this is with re, where it will matches "Action" followed by zero or more occurrences of -[A-Za-z]+ i.e -, capital and lowercase characters.

import re

string = "ActionAction-AdventureShooterStealth"
pattern = r"Action(-[A-Za-z]+)*"
string_list = re.findall(pattern, string)
print(string_list) 

Output:

['Action', 'Action-Adventure', 'Shooter', 'Stealth']
2
On

text.split(',') doesn't work, because text doesn't have any , separators in it. In fact the patterns are not separated, so splitting cannot work here. You need to extract matching patterns.

A suitable regex pattern for a genre in the simplest case seems to be an uppercase letter followed by one or more lowercase letters: [A-Z][a-z]+.

A suitable regex pattern for genres that are composed of two simple genre names like the one earlier, separated by - could be written as: [A-Z][a-z]+-[A-Z][a-z]+

To combine the two cases in one, we can make the second genre name optional, by adding enclosing it in (...) and adding a ?, which means "zero or one": [A-Z][a-z]+(-[A-Z][a-z]+)?

If you want to support more than two names with - separator, you could change the ? to * to mean "zero or more": [A-Z][a-z]+(-[A-Z][a-z]+)*

You can use the re package in Python to match regex patterns, and the findall function to extract all matching patterns.

>>> re.findall(r'[A-Z][a-z]+(?:-[A-Z][a-z]+)*', 'ActionAction-AdventureShooterStealth')
['Action', 'Action-Adventure', 'Shooter', 'Stealth']

Note that:

  • Regex patterns should be written as raw strings r'...' instead of regular '...' strings
  • (...) in a regex is a capturing group. When there are capturing groups in the pattern, the findall function returns the captured groups only, instead of the entire matched patterns. That's not what we want here, we must write (?:...) instead of (...) to make it a non-capturing group.

If you will use the same pattern with multiple strings to match, then it's recommended to compile the pattern first. For example:

pattern = re.compile(r'[A-Z][a-z]+(?:-[A-Z][a-z]+)*')
pattern.findall('ActionAction-AdventureShooterStealth')
# returns: ['Action', 'Action-Adventure', 'Shooter', 'Stealth']

pattern.findall('ComedyAction')
# returns: ['Comedy', 'Action']

pattern.findall('Action-ShooterStealth')
# returns: ['Action-Shooter', 'Stealth']
1
On

Easiest approach to do this is by using Index approach use python formatting and separate by ',' each letter and get the output as list by using list() or by directly writing in []