Remove part of a string in each row of a large column of data in KNIME

1.4k Views Asked by At

I am stumbed.

I have a column with some thousand rows of unique adresses regarding universities, pharmacompanies etc. in a KNIME workflow

Example: 55 Shattuck Street Boston Massachusetts 02115 US [NAT: US RES: US] for all designated states

What I need is to clean the data, so each row look like nice and computable like this: 55 Shattuck Street Boston Massachusetts 02115 US.

My problem Is I can't seem to get the system to remove everything after US. Does anyone know a suitable approach in KNIME?

2

There are 2 best solutions below

0
On BEST ANSWER

You should be able to use either String Replacer or String Manipulation for this. The first one lets you use either a simple wildcard or a full regular expression pattern while the second one uses a Java-like syntax - the choice comes down to how many different variations on the input data you need to handle and which syntax you prefer.

If you just need to remove any text between square brackets including the space before the open bracket then you can use String Replacer configured like this:

enter image description here

1
On

Beside the nodes which were already mentioned by nekomatic and which will work perfectly for the given scenario, there's also a user-friendly regular expression tool in the Palladian nodes extension called Regex Extractor, which allows you to build your regexes with a live preview as you might know from popular online regex testers.

For your scenario, you could e.g. set up a regex like this:

^(?<address>.*)(?:\s\[.*)

In prose, this means: Capture all characters until a space + square opening bracket and output into a column named address.

The Palladian extension is available here as a free plugin for KNIME Desktop and provides a variety of different tools for web, text, and geo data mining and classification.

Regex Extractor