Please do not block me for this question, i tried to find the answer for about a month and i can not find it and you are my last hope(please if you want to report it at first answer me and then report,thanks). I write an Hybrid text classification code in MATLAB and i did it correct but now i do not know how to evaluate the results. I know nothing about training set and examination set in Reuters-21578 and i can not understand them. my code finds the keywords in a text and with the help of a hybrid KNN algorithm put the text in its accurate class but the problem is that i do not know what are these candidate classes?should i make them or they are ready? If each .sgm file in Reuters-21578 is a class then how can i use them as a candidate class, i mean they are full of words, so should i classify them first and reach to choosen classes that other documents can be classified according to them?
An evaluation of text classification method with Reuters-21578 dataset
844 Views Asked by deansam At
2
There are 2 best solutions below
0
Hima Varsha
On
I have been through the same. If the version of the reuters dataset doesn't matter to you, then reuters dataset is also available in nltk.corpus from which you can access the test documents, train documents and their respective categories easily. You do not have to worry about extracting them from .sgm files.
You can do this:
from nltk.corpus import reuters
#This gives all files
documents = reuters.fileids()
#to get only the training and testing documents
train_docs = filter(lambda doc: doc.startswith("train"),documents);
test_docs = filter(lambda doc: doc.startswith("test"),documents);
#To get the raw data of a document
data = reuters.raw(documents[0])
#to get the categories/class in your case
category = reuters.categories(documents[0])
Now, you can use these to train and test. In a simple nut shell, test_docs and train_docs contain documents with raw content and their respective category which can be got by the above methods.
Related Questions in TEXT
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- How to increase quality of mathjax output?
- How to appropriately handle newlines and the escaping of them?
- How to store data with lots of subdata but keep easy and simple access in python
- Can I make this kind of radio button?
- I am findind it dificult to create a box containing text
- Replacing Text using Javascript
- How to set text inside a div using JavaScript and CSS
- How to get new text input after entering a password in a tab?
- How can I get my hero section to look like this?
- Find text and numbers Formatted: "Case: BE########" and format them, regardless of the number
- Auto style text in flutter
- Text analytics and Insights
- Combine an audio and a text file as one single file
- How to align side text and table horizontally in R-markdown
Related Questions in DATASET
- How to add a new variable to xarray.Dataset in Python with same time,lat,lon dimensions with assign?
- Power BI Automations of Audits and APIs
- Trouble understanding how to use list of String data in a Machine Learning dataset - Features expanded before making prediction
- how to difference values within several panels
- How to use an imported Excel file inside Anylogic model
- Need to be able to load different reports into the same report viewer, based on the selection of a combobox value How do i do this?
- Can i merge my custom model and pretrained model in yolov9
- How to access the whole public dataset hosted on a website?
- Use dataset name in knitr code chunk in R
- How many images should I label from the training set?
- How to get a list of numbers out of an awk output in bash
- Wrong file reading in Jupyter
- Request for Rui Li twitter dataset
- Illustrator file to single word Dataset
- Image augmentation for dataset creation
Related Questions in EVALUATION
- Difference between model.evaluate and metrics.accuracy_score
- How to share lexical environment with recursive functions in a custom interpreter?
- How can I make an effective Evaluation function for a Draughts/Checkers game with Minimax + alpha-beta pruning?
- Cross validation and/or train_test_split in scikit-learn?
- Hyperparameter tuning and model evaluation in scikit-learn
- Can you assign function arguments before they have been evaluated?
- How to get the accuracy validation per epoch or step for the huggingface.transformers Trainer?
- Using detectron2 to train Mask RCNN custom instance segmentation to find confusion matrix, f1 score, IOU
- Different between CompileTime and RunTime evaluation
- Verify (re-run) client side math calculations on the server
- Generate Questions From TextNodes
- Simple expression evaluation syntax
- How to calculate Precision and Recall If There is No Negative Class
- MLFlow: Consider running at a lower rate. How do I do so?
- Evaluation of answers obtained from RAG architecture with RAGAS without OPENAI keys
Related Questions in TEXT-CLASSIFICATION
- integrate huggingface inference endpoint with flowise
- How to automate report writing by extracting relevant text?
- Text clustering based on “stance” rather than the distribution of embeddings as the basis for clustering
- Not able to do grid search and train the model
- SVM algorithm training fitting doesnt work for text classification
- How to use GradCAM for text classification with 1D CNN
- Getting different probability scores for same text when passed in batches at the time of prediction for custom tuned BERT in text classification
- How to run Llama2 model on gpu in Macbook Pro M2 Max using Python
- Document Image Classification
- How to reset parameters from AutoModelForSequenceClassification?
- I can't get trainer accuracy
- Shap value for binary classification using Pre-Train Bert: How to extract summary graph?
- Hugging Face - ValueError: `create_and_replace` does not support prompt learning and adaption prompt yet
- speeding up zero-shot text classification in python
- Creating Embedding Matrix for LSTM Model with BERT Feature Representations on Arabic Dataset
Related Questions in REUTERS
- How to import a corpus from nltk in a variable to form ngarms in python?
- web scrap stock data from Reuters
- Is there a xml url for RSS Business newsfeed of Reuters or Bloomberg for flutter?
- Why do error message appear in R using websocket (Reuters)?
- Reuters Eikon/Datastream - obtain issue date of structured products
- pyrfa DLL load fails
- How to pass Reuters-21578 dataset as an input parameter for tokenize funktion in Python
- How to get more than 20 news headline links for a subsection (e.g. Middle East) of Reuters website using Python?
- Is there a dictionary for labels in keras.reuters.datasets?
- NLTK reuters datasets not found
- Resource reuters not found
- Python3: Multi-label text classification with reuters 21578 data set
- Reconstruct news texts from Keras' reuters dataset
- Python package function does not match C++ signature
- Reuters dataset classes
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
The tag for each article/news can be considered as the class label. You can split the stories with topics into a training set, and a test set to evaluate your classifier. There are stories in reuters- 21578 without any topics, you can use your classifier to assign class labels to these.
Note: There are many stories with multiple topics.