I want to delete all rows of a dataframe that match one or more rows in a filtering dataframe.
Yes, I know about filter by one regex and I also know how join can be leveraged when there is a full match on a column. This isn't a direct match, except through looping the filter dataframe row by row.
It is a relatively trivial problem in sql to apply this filter in bulk, on the server, without looping with client-side code:
given:
data.csv
filename,col2
keep.txt,bar
skip.txt,foo
keep2.txt,zoom
skip3.txt,custom1
discard.txt,custom2
file3.txt,custom3
discard2.txt,custom4
file4.txt,custom5
filter.csv:
skip
discard
skip
Here's the sql using postgres. It will, and that is the key point here, scale very well.
withsql.sql
\c test;
DROP TABLE IF EXISTS data;
DROP TABLE IF EXISTS filter;
CREATE TABLE data (
filename CHARACTER(50),
col2 CHARACTER(10),
skip BOOLEAN DEFAULT FALSE
);
\copy data (filename,col2) FROM './data.csv' WITH (FORMAT CSV);
CREATE TABLE filter (
skip VARCHAR(20)
);
\copy filter FROM './filter.csv' WITH (FORMAT CSV);
update filter set skip = skip || '%';
update data set skip = TRUE where exists (select 1 from filter s where filename like s.skip);
delete from data where skip = TRUE;
select * from data;
psql -f withsql.sql
this gives as output:
You are now connected to database "test" as user "djuser".
...
UPDATE 4
DELETE 4
filename | col2 | skip
----------------------------------------------------+------------+------
filename | col2 | f
keep.txt | bar | f
keep2.txt | zoom | f
file3.txt | custom3 | f
file4.txt | custom5 | f
(5 rows)
Now, I can do with polars, but the only thing I can think of is using a loop on the filter.csv:
withpolars.py
import polars as pl
df_data = pl.read_csv("data.csv")
df_filter = pl.read_csv("filter.csv")
for row in df_filter.iter_rows():
df_data = df_data.filter(~pl.col('filename').str.contains(row[0]))
print("data after:\n", df_data)
The output is correct, but I do this without looping, somehow? And... just curious how some of these bulk sql approaches map to dataframes.
data after:
shape: (4, 2)
┌───────────┬─────────┐
│ filename ┆ col2 │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪═════════╡
│ keep.txt ┆ bar │
│ keep2.txt ┆ zoom │
│ file3.txt ┆ custom3 │
│ file4.txt ┆ custom5 │
└───────────┴─────────┘
There is a dedicated partial/substring matching function:
.str.contains_any()As for your current approach, I think the issue is calling
.filtereach time.Instead, you would build a single expression and call it once. (allowing Polars to parallelize the work)