I'm a beginner on pyflink framework and I would like to know if my use case is possible with it ...
I need to make a tumbling windows and apply a python udf (scikit learn clustering model) on it. The use case is : every 30 seconds I want to apply my udf on the previous 30 seconds of data.
For the moment I succeeded in consume data from a kafka in streaming but then I'm not able to create a 30seconds window on a non-keyed stream with the python API.
Do you know some example for my use case ? Do you know if the pyflink API allow this ?
Here my first shot :
from pyflink.common import Row
from pyflink.common.serialization import JsonRowDeserializationSchema, JsonRowSerializationSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer, FlinkKafkaProducer
from pyflink.common.watermark_strategy import TimestampAssigner, WatermarkStrategy
from pyflink.common import Duration
import time
from utils.selector import Selector
from utils.timestampAssigner import KafkaRowTimestampAssigner
# 1. create a StreamExecutionEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
# the sql connector for kafka is used here as it's a fat jar and could avoid dependency issues
env.add_jars("file:///flink-sql-connector-kafka_2.11-1.14.0.jar")
deserialization_schema = JsonRowDeserializationSchema.builder() \
.type_info(type_info=Types.ROW_NAMED(["labelId","freq","timestamp"],[Types.STRING(),Types.DOUBLE(),Types.STRING()])).build()
kafka_consumer = FlinkKafkaConsumer(
topics='events',
deserialization_schema=deserialization_schema,
properties={'bootstrap.servers': 'localhost:9092'})
# watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5))\
# .with_timestamp_assigner(KafkaRowTimestampAssigner())
ds = env.add_source(kafka_consumer)
ds.print()
ds = ds.windowAll()
# ds.print()
env.execute()
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/home/dorian/dataScience/pyflink/pyflink_env/lib/python3.6/site-packages/pyflink/lib/flink-dist_2.11-1.14.0.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Traceback (most recent call last):
File "/home/dorian/dataScience/pyflink/project/__main__.py", line 35, in <module>
ds = ds.windowAll()
AttributeError: 'DataStream' object has no attribute 'windowAll'
Thx