I have some grouped data where each group looks like this
(2023-02-27,
{(2023-02-27T00:00:00.000Z,2023-02-27,6736,358,2,887,92477,6736,9307,19.44),
(2023-02-27T00:00:00.000Z,2023-02-27,8765,348,3,27,93970,8765,5156,3.4),
(2023-02-27T00:00:00.000Z,2023-02-27,8765,348,3,27,85001,8765,4841,0.34),
(2023-02-27T00:00:00.000Z,2023-02-27,9010,497,2,26,86009,9010,318,13.11),
(2023-02-27T00:00:00.000Z,2023-02-27,1348,819,3,833,87511,1348,10020,17.57),
(2023-02-27T00:00:00.000Z,2023-02-27,9642,261,2,221,95144,9642,5983,1.61),
(2023-02-27T00:00:00.000Z,2023-02-27,9642,261,2,221,84270,9642,5175,0.3),
(2023-02-27T00:00:00.000Z,2023-02-27,9642,261,2,221,92821,9642,8775,12.17),
(2023-02-27T00:00:00.000Z,2023-02-27,8231,26,3,860,81634,8231,5376,10.69),
(2023-02-27T00:00:00.000Z,2023-02-27,2267,1361,2,86,88583,2267,9738,0.27),
(2023-02-27T00:00:00.000Z,2023-02-27,6736,358,2,887,85082,6736,3350,3.53),
(2023-02-27T00:00:00.000Z,2023-02-27,312,184,2,793,93376,312,3098,16.92),
(2023-02-27T00:00:00.000Z,2023-02-27,312,184,2,793,83578,312,1441,2.24)})
--The description of the above data
grouped_data: {group: chararray,joined_data: {(
formatted_order_data::order_date: datetime,
formatted_order_data::order_day: chararray,
formatted_order_data::order_id: int,
formatted_order_data::customer_id: int,
formatted_order_data::shipping_method_id: int,
formatted_order_data::dest_address_id: int,
order_line_data::line_id: chararray,
order_line_data::order_id: int,
order_line_data::book_id: int,
order_line_data::price: float)}}
I'd like to count the number of book_id, count the distinct number of order_id and sum the price for each group. How to acheive this?
The expected result would be:
order_day num_orders num_books total_price
(2023-02-27, 8, 13, 101.59)
I've tried it like this but it doesn't work. It seems there is something wrong with referencing to oined_data.order_id and joined_data.price. How should I modify the script?
result = FOREACH grouped_data GENERATE
group AS order_day,
COUNT(joined_data) AS num_books, -- Count the rows in the joined_data bag
COUNT(DISTINCT joined_data.order_id) AS num_orders,
SUM(joined_data.price) AS total_price;