Hi I'm trying to generate small virtual dataset for association rules analysis using apriori algorithm. And I'm wondering if it matters when there's same data in the dataset. For example, [milk,milk,banana,yogurt], [apple, meat, soap, apple] Thanks for reading!
Here's what i've coded so far
generate random sample (weighted)
products = ['홈','어웨이','마킹','입장용트랙탑(블랙)','레인자켓(블랙)','레인자켓(레드)','패딩수트상의(블랙)','패딩수트상의(레드)','선수단롱다운(블랙)','패딩베스트(블랙)','이동복상의(블랙)','이동복상의(블랙)', '트레이닝상의(블랙)','트레이닝상의(레드)','바람막이피스테(블랙)','바람막이피스테(레드)','연습복긴팔(블랙)','연습복긴팔(레드)', '연습복반팔(블랙)','연습복반팔(레드)','폴로티긴팔(블랙)','폴로티긴팔(레드)','폴로티반팔(블랙)','폴로티반팔(레드)','트레이닝하의(블랙)','3/4팬츠(블랙)', '연습복반바지(블랙)','응원용품','FC서울로고니트머플러','FC서울SoulofSeoul니트머플러','FC서울브랜딩니트머플러','FC서울WHITE니트머플러', '서울오리지널머플러','기성용캡틴머플러','전사골드머플러','전사블랙머플러','아동유니폼','선수단볼캡블랙','선수단볼캡레드','선수단동계비니','40주년백구','선수단신발주머니','FC서울MINI레인보우', 'FC서울포토레인보우','유니폼뱃지','엠블럼뱃지','레터링뱃지'] #총 47개 prob=[0.2454,0.0966,0.2316,0.0026,0.0031,0.0016,0.0027,0.0001,0.0033,0.0003,0.0021,0.0004,0.0010,0.0017,0.0013,0.0034,0.0011, 0.0029,0.0030,0.0024,0.0007,0.0009,0.0023,0.0006,0.0014,0.0019,0.0020,0.0368,0.0174,0.0464,0.0116,0.0058,0.0232, 0.0406,0.0348,0.0291,0.0107,0.0231,0.0069,0.0208,0.0162,0.0046,0.0093,0.0116,0.0185,0.0023,0.0139]
random_data = np.random.choice(products, size=50000,p=prob)
Generate 1000 customers list, I tried to solve the redundancy with set(list_random) but It doesn't follow the weights of sample above
store=[] for i in range(1,1000): for j in range(random.randint(1,4)): randsample=random.sample(set(list_random),j) store.append(randsample) #print(store)
df = pd.DataFrame(store) df.head(10)
(I thought there shouldn't be same data in the dataset for apriori algorithm But I ran the algorithm with dataset that has same data, it worked.)