Questions tagged [natural-language]
Natural Language Processing is a set of techniques from linguistics, artificial intelligence, machine learning and statistics that aim at processing and understanding human languages.
1,143 questions
1
vote
0
answers
63
views
In the original InstructGPT paper, why is the loss divided by K choose 2?
In the original InstructGPT paper, the loss of the reward model is as follows:
Why do the authors divide by ${K}\choose{2}$? If, for example, we have $7$ prompts and $5$ completions per prompt, the ...
2
votes
2
answers
103
views
Which totals should I use in a Chi-square analysis?
I'm creating a presentation for some secondary math teachers. I want them to see how AI's ability to write code opens up a lot more data and analysis opportunities for them. For my example, I'm using ...
0
votes
0
answers
75
views
Conditional independence assumption for Naive Bayes with Multinomial distribution
I was going through Naive Bayes Classifier (from Cornell Machine Learning course (link here) and I found quite confusing the use of the Naive Bayes classifier for bag-of-words with the Multinomial ...
3
votes
1
answer
210
views
Why 2 different sets of weights in word2vec
I am studying from here https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1
The author talks about 2 sets of weights, in the first hidden layer you have $W^1$ matrix and in the ...
2
votes
0
answers
42
views
How to conduct A/B testing for AI models properly with limited dataset (NLP)
Situation:
I want to compare the performance of two models on the same task. I have a dataset of around 400 manually curated samples. The task is relatively niche (targeted sentiment analysis on ...
1
vote
1
answer
130
views
Pseudo label as ground truth?
I'm new to machine learning and currently working on new topic discovery and topic modelling under nlp.
If I have unlabeled survey responses that I want to categorise but don't know how, run an NMF ...
1
vote
0
answers
42
views
Parsing maritime location ranges
I'm attempting to train a model to parse maritime location ranges. These are strings that can be resolved into a geographical area or a list of shipping ports.
An example could be ...
0
votes
0
answers
18
views
Why Is My Fine-Tuned RoBERTa (Text classification) Model Only Predicting One Category/Class [duplicate]
(EDIT: Note my question is not about 'accuracy'/F1 as a measure of precision, but rather why we cant get the correct my test prediction script to work and how to merge the LORA back into the ROBERTA ...
0
votes
0
answers
55
views
Calculating Precision and Recall in Spell Correction when the input sentence has no errors
I am doing a project on spell correction. While evaluating the model results, I came across this situation: the input sentence has no errors, and the model outputs the input sentence as it is, which ...
1
vote
0
answers
17
views
Why Is My Fine-Tuned RoBERTa (Text classification) Model Only Predicting One Category/Class? [duplicate]
I’m fine-tuning RoBERTa to classify text into 199 categories (e.g., “acculturation stress,” “cognitive flexibility,” etc.). My dataset has ~15,000 lines of text, each mapped to one of these well-being ...
2
votes
2
answers
164
views
Why we don't mask other layers besides the multihead attention in transformers?
Typically when training for NLP tasks, we need to pad our sequences to a max_len, so they can be processed efficiently in a batch-wise manner. However, these padded ...
3
votes
1
answer
106
views
Probability Distribution Underlying Ngram Model
Texts introducing ngram models often directly manipulate conditional probabilities. For example, given a corpus $V$ with a bigram model on its words, we would compute the probability of a sentence $...
1
vote
0
answers
37
views
Stanza bad performance on Named Entity Identification
I'm using both SpaCy and Stanza to identify named entities in very short string (brand names and business names):
...
0
votes
0
answers
94
views
Regression on SQL query texts. Good ML model architecture
Fast regression on SQL queries. Good ML model architecture.
Our goal is to predict which SQL engine (there are 2 currently) will be faster to execute a given query.
The input is the query text and in ...
2
votes
0
answers
40
views
Unsupervised clustering of short texts with covariates
I posted this on the Data Science Stack Exchange and didn’t get any responses (that sight seems pretty dead). So I’m trying here!
I'm working on a project where I have to categorise short texts. I don'...
1
vote
0
answers
89
views
Measuring Similarity in Embedding Spaces? [closed]
For context, I've been using feature hashing for a rapid text classifier with a very small number of features (2000, it is very small on purpose). I noticed that some of the results were a bit wonky ...
1
vote
1
answer
110
views
NER With Custom Tags, How to Approach
I am building a "field tagger" for documents. Basically, a document, in my case something like a proposal or sales quote, would have a bunch of entities scattered throughout it, and we want ...
4
votes
2
answers
746
views
Overfitting in randomForest model in R, WHY?
I am trying to train a Random Forest model in R for sentiment analysis. The model works with tf-idf matrix and learns from it how to classify a review, in positive or negative.
Positive ones are ...
0
votes
1
answer
60
views
Find event date given the probabilities of finding an event
I have a set of clinical notes with dates for each patient and an NLP models which gives a score between 0.0 and 1.0 of a certain event being present in the note. Given the scores, what is the best ...
0
votes
1
answer
88
views
Clustering of large text datasets with unknown number of clusters
I have a list of hotel names which may or may not be correct, and with different spellings (such as '&' instead of 'and'). I want to use clustering in order to group the hotels with different ...
1
vote
0
answers
89
views
BERT eval loss increase while performance metrics also increase
I want to fine-tune BERT for Named Entity Recognition (NER). However, when fine-tuning over several epochs on different datasets I get a weird behaviour where the training loss decreases, eval loss ...
2
votes
1
answer
235
views
If a document set is too small for running a topic model, can you simply multiply the document set by a factor of 10 to be able to run the model?
Say I'm using Top2Vec as a topic model to capture the top 10 salient topics across documents. I have an array that contains the documents of the corpus. Initially, there are not enough documents to ...
0
votes
0
answers
411
views
Bert Used for generative AI
I have a doubt regarding using "Bert" as a generative model. I know Bert can be used for classification or fine-tuning the question-answering. However, is it possible to use Bert to generate ...
1
vote
0
answers
51
views
Encoder-decoder Transformer model makes outputs predictions almost perfectly but fails to autoregressively decode
The model's sample predictions that I'm printing during training are almost perfect but the model generates meaningless tokens during evaluation.
For training I'm feeding it the source and target ...
0
votes
0
answers
389
views
Regression with text data
My goal is to create a regression model with text data where encoded text predicts a value, (news headlines, or article summaries, predicting number of clicks). The y is very left-skewed (few articles ...
0
votes
0
answers
109
views
Classification in BERT - why not use class as a feature?
I am currently following this post, which details how BERT was trained. I had a few questions about the classification task:
In the post, it mentions that the authors of BERT decided to add ...
2
votes
0
answers
102
views
Can I calculate the significance of the number of deponent verbs with a certain feature like this?
In a language like Ancient Greek, verbal forms are marked for voice (active/middle/passive). Deponent verbs are verbs that exist only in the middle (or passive) voice, but appear to have an active ...
2
votes
1
answer
139
views
Gradient Clipping of Vanilla RNNs vs LSTMs
I am doing an online course that states that the reason we use LSTMs and similar variations of vanilla RNNs is because of the vanishing/exploding gradients problems with vanilla RNNs.
However, an ...
0
votes
1
answer
125
views
Why is the WordPiece algorithm implemented according to the maximum mutual information?
WordPiece is a subword segmentation algorithm in the field of natural language processing. Different from BPE, WordPiece will select a pair with the largest mutual information to merge each time, and ...
1
vote
0
answers
75
views
Does skipgram model uses backpropagation?
I just started to get interested in natural language processing and I was trying to understand the skipgram model from word2vec. I was reading this interesting website. However, in the mentioned ...
3
votes
3
answers
622
views
Countering t-test "any feature is significant" results for large sample size datasets
I'm doing some analysis over natural language data, which basically entails:
Computing some feature over all samples.
Evaluating if this feature statistically significantly discriminates between ...
0
votes
0
answers
74
views
How is metadata represented in sentiment analysis?
There are papers on semantic analysis using metadata such as "Sentiment Classification on Steam Reviews" (https://cs229.stanford.edu/proj2017/final-reports/5244171.pdf) and "Detecting ...
0
votes
1
answer
1k
views
What is the Llama2 number of steps? [closed]
Llama2 is pretrained with 2 trillion of tokens: $2\times10^9$, and its batch size is of $4\times 10^6$.
We can calculate the number of steps (times we upgrade the parameters) per epoch as follows:
$$\...
2
votes
0
answers
80
views
Does it make sense to perform Domain Adaptation before Transfer Learning?
Suppose I would like to do extractive question answering on scientific literature. I'm interested in using BERT which was pretrained on Wiki and Bookcorpus. I see two routes here:
1. Fine-tune BERT on ...
0
votes
1
answer
81
views
Considering weights right of the embeddings layer aren't used in Doc2Vec/Word2Vec, is the informative capacity of the embeddings not strongly reduced?
In an extreme (and probably impossible) example, could you not end up with all the power for the prediction being contained in the weights to the right of the embeddings layer?...and thus the matrix ...
1
vote
1
answer
183
views
How are vector values assigned initially in Word2Vec and how are they changed with iterations of the algorithm?
I am new to NLP and I'm not fully grasping how word2vec works. I understand that it aims to predict a word given its context or a context given a word but I'm not sure how the initial vector values ...
0
votes
1
answer
688
views
Using Word Embeddings in Clustering and Topic Modelling
I am new to the field of NLP and would appreciate any guidance please. I am trying to understand how word embeddings can be used in clustering and topic modelling. If I create word embeddings for ...
1
vote
1
answer
302
views
How does training word embeddings bring similar words closer together?
How does training of word embeddings lead to the clustering of similar words in the embedding space? What causes that effect?
1
vote
0
answers
277
views
How to determine EC2 instance type and memory for LLM inference endpoint [closed]
I am trying to estimate the costs required for hosting a fine tuned large language model for real time inference. There will be 100s of users querying the endpoint concurrently for multiple use cases ...
1
vote
0
answers
63
views
Machine learning and Natural Language Processing Algorithms for Indian Surnames Homophones [closed]
Homophones
Indian Surnames List
English last names
Can machine learning, Natural Language Processing (NLP), Artificial intelligence assist in classifying , interpreting and specifying the differences ...
0
votes
1
answer
75
views
Creating a morphology tagging/labeling model
I had an idea of building a model using machine learning or deep learning in order to perform morphological tagging/labeling on untagged/unlabeled data.
I have a lot of tagged/labeled data (about 30,...
1
vote
2
answers
263
views
Is concatenating a single integer sufficient for encoding positional embeddings in transformer models?
In transformer models, positional embeddings are commonly used to encode the positional information of words in a sequence. While sinusoidal positional embeddings are often employed, I'm curious about ...
1
vote
1
answer
133
views
Input non-sequential data of arbitrary size to network
I have a case where I want to feed a network with polylines of data. The problem is that the input can be any number of polylines and the polylines can consist of any number of points. If we instead ...
3
votes
0
answers
533
views
Attention is All You Need: How to calculate params number of the models?
I want to re-calculate the last column of Table 3 of Attention is All You Need,
i.e. number of params in the models.
But numbers from my calculation do not match.
Model
Params from Table 3 ($\times 10^...
1
vote
0
answers
222
views
Sentiment Analysis with a Continuos Output Labels
Problem Setting/Context:
I have feedback(each feedback has multiple sentences) associated with different products(you can safely assume that a feedback talks about one single product), I need to ...
1
vote
0
answers
41
views
Accounting for edge cases without training on the test set
I'm fine-tuning a large language model to predict binary sentiment, where a false negative is far more costly for my use case than a false positive. I've used weighted cross-entropy to account for ...
1
vote
2
answers
540
views
Why the Transformer model does not require negative sampling but word2vec does?
Both word2vec and transformer model compute a SOFTMAX function over the words/tokens on the output side.
For word2vec models, negative sampling is used for computational reasons:
Is negative sampling ...
2
votes
0
answers
123
views
Why does the best performing adapter-based parameter-efficient fine-tuning depend on the language model being fine-tuned?
https://arxiv.org/abs/2304.01933 shows that the best performing adapter-based parameter-efficient fine-tuning depends on the language model being fine-tuned:
E.g., LORA is the best adapter for LlaMa-...
1
vote
1
answer
57
views
Best Way to do Hyperparam Search and Cross-Validation
I'm making experiments to evaluate language models to brazilian portuguese datasets.
So, i've made so each dataset is divided in 10 parts, I want to use cross-validation to determine the model's ...
0
votes
1
answer
289
views
Estimating exponent of Zipf distribution using MLE vs fitting linear regression on log-transformed rank and frequency data
I'm having trouble understanding why I get radically different results if I try to find the parameter of a Zipf distribution when I use the methods proposed by Clauset et al. (2009) as opposed to ...