Newest 'natural-language' Questions

1 vote

0 answers

63 views

In the original InstructGPT paper, why is the loss divided by K choose 2?

In the original InstructGPT paper, the loss of the reward model is as follows: Why do the authors divide by ${K}\choose{2}$? If, for example, we have $7$ prompts and $5$ completions per prompt, the ...

user1446642

63

asked Sep 26 at 16:18

2 votes

2 answers

103 views

Which totals should I use in a Chi-square analysis?

I'm creating a presentation for some secondary math teachers. I want them to see how AI's ability to write code opens up a lot more data and analysis opportunities for them. For my example, I'm using ...

Sciolism Apparently

129

asked Jul 28 at 14:51

0 votes

0 answers

75 views

Conditional independence assumption for Naive Bayes with Multinomial distribution

I was going through Naive Bayes Classifier (from Cornell Machine Learning course (link here) and I found quite confusing the use of the Naive Bayes classifier for bag-of-words with the Multinomial ...

Reda A.

1

asked Jul 18 at 15:34

3 votes

1 answer

210 views

Why 2 different sets of weights in word2vec

I am studying from here https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1 The author talks about 2 sets of weights, in the first hidden layer you have $W^1$ matrix and in the ...

Baron Yugovich

509

asked Jul 7 at 22:27

2 votes

0 answers

42 views

How to conduct A/B testing for AI models properly with limited dataset (NLP)

Situation: I want to compare the performance of two models on the same task. I have a dataset of around 400 manually curated samples. The task is relatively niche (targeted sentiment analysis on ...

Kelly

21

asked Jul 3 at 13:36

1 vote

1 answer

130 views

Pseudo label as ground truth?

I'm new to machine learning and currently working on new topic discovery and topic modelling under nlp. If I have unlabeled survey responses that I want to categorise but don't know how, run an NMF ...

viktor nikiforov

23

asked Jun 13 at 5:41

1 vote

0 answers

42 views

Parsing maritime location ranges

I'm attempting to train a model to parse maritime location ranges. These are strings that can be resolved into a geographical area or a list of shipping ports. An example could be ...

Stromgren

119

asked May 20 at 9:29

0 votes

0 answers

18 views

Why Is My Fine-Tuned RoBERTa (Text classification) Model Only Predicting One Category/Class [duplicate]

(EDIT: Note my question is not about 'accuracy'/F1 as a measure of precision, but rather why we cant get the correct my test prediction script to work and how to merge the LORA back into the ROBERTA ...

Llewellyn van Zyl

11

asked Mar 24 at 8:12

0 votes

0 answers

55 views

Calculating Precision and Recall in Spell Correction when the input sentence has no errors

I am doing a project on spell correction. While evaluating the model results, I came across this situation: the input sentence has no errors, and the model outputs the input sentence as it is, which ...

Tharusha Bandaranayake

1

asked Mar 19 at 7:03

1 vote

0 answers

17 views

Why Is My Fine-Tuned RoBERTa (Text classification) Model Only Predicting One Category/Class? [duplicate]

I’m fine-tuning RoBERTa to classify text into 199 categories (e.g., “acculturation stress,” “cognitive flexibility,” etc.). My dataset has ~15,000 lines of text, each mapped to one of these well-being ...

Llewellyn van Zyl

11

asked Mar 18 at 16:13

2 votes

2 answers

164 views

Why we don't mask other layers besides the multihead attention in transformers?

Typically when training for NLP tasks, we need to pad our sequences to a max_len, so they can be processed efficiently in a batch-wise manner. However, these padded ...

Antonios Sarikas

881

asked Jan 19 at 18:58

3 votes

1 answer

106 views

Probability Distribution Underlying Ngram Model

Texts introducing ngram models often directly manipulate conditional probabilities. For example, given a corpus $V$ with a bigram model on its words, we would compute the probability of a sentence $...

olives

93

asked Nov 27, 2024 at 2:04

1 vote

0 answers

37 views

Stanza bad performance on Named Entity Identification

I'm using both SpaCy and Stanza to identify named entities in very short string (brand names and business names): ...

LearningScholar

294

asked Sep 24, 2024 at 16:07

0 votes

0 answers

94 views

Regression on SQL query texts. Good ML model architecture

Fast regression on SQL queries. Good ML model architecture. Our goal is to predict which SQL engine (there are 2 currently) will be faster to execute a given query. The input is the query text and in ...

Ark-kun

141

asked Aug 13, 2024 at 0:15

2 votes

0 answers

40 views

Unsupervised clustering of short texts with covariates

I posted this on the Data Science Stack Exchange and didn’t get any responses (that sight seems pretty dead). So I’m trying here! I'm working on a project where I have to categorise short texts. I don'...

James

45

asked Jul 30, 2024 at 8:23

1 vote

0 answers

89 views

Measuring Similarity in Embedding Spaces? [closed]

For context, I've been using feature hashing for a rapid text classifier with a very small number of features (2000, it is very small on purpose). I noticed that some of the results were a bit wonky ...

Felix Labelle

31

asked Jul 26, 2024 at 22:54

1 vote

1 answer

110 views

NER With Custom Tags, How to Approach

I am building a "field tagger" for documents. Basically, a document, in my case something like a proposal or sales quote, would have a bunch of entities scattered throughout it, and we want ...

redbull_nowings

31

asked Jul 18, 2024 at 17:17

4 votes

2 answers

746 views

Overfitting in randomForest model in R, WHY?

I am trying to train a Random Forest model in R for sentiment analysis. The model works with tf-idf matrix and learns from it how to classify a review, in positive or negative. Positive ones are ...

Anisa B.

143

asked May 8, 2024 at 21:59

0 votes

1 answer

60 views

Find event date given the probabilities of finding an event

I have a set of clinical notes with dates for each patient and an NLP models which gives a score between 0.0 and 1.0 of a certain event being present in the note. Given the scores, what is the best ...

rhn89

101

asked Mar 7, 2024 at 23:51

0 votes

1 answer

88 views

Clustering of large text datasets with unknown number of clusters

I have a list of hotel names which may or may not be correct, and with different spellings (such as '&' instead of 'and'). I want to use clustering in order to group the hotels with different ...

user480840

103

asked Feb 20, 2024 at 21:17

1 vote

0 answers

89 views

BERT eval loss increase while performance metrics also increase

I want to fine-tune BERT for Named Entity Recognition (NER). However, when fine-tuning over several epochs on different datasets I get a weird behaviour where the training loss decreases, eval loss ...

CodingSquirrel

11

asked Feb 9, 2024 at 22:23

2 votes

1 answer

235 views

If a document set is too small for running a topic model, can you simply multiply the document set by a factor of 10 to be able to run the model?

Say I'm using Top2Vec as a topic model to capture the top 10 salient topics across documents. I have an array that contains the documents of the corpus. Initially, there are not enough documents to ...

NominalSystems

41

asked Jan 18, 2024 at 4:20

0 votes

0 answers

411 views

Bert Used for generative AI

I have a doubt regarding using "Bert" as a generative model. I know Bert can be used for classification or fine-tuning the question-answering. However, is it possible to use Bert to generate ...

Encipher

185

asked Nov 20, 2023 at 20:21

1 vote

0 answers

51 views

Encoder-decoder Transformer model makes outputs predictions almost perfectly but fails to autoregressively decode

The model's sample predictions that I'm printing during training are almost perfect but the model generates meaningless tokens during evaluation. For training I'm feeding it the source and target ...

Sean

4,347

asked Oct 19, 2023 at 16:40

0 votes

0 answers

389 views

Regression with text data

My goal is to create a regression model with text data where encoded text predicts a value, (news headlines, or article summaries, predicting number of clicks). The y is very left-skewed (few articles ...

user3722736

131

asked Sep 28, 2023 at 12:15

0 votes

0 answers

109 views

Classification in BERT - why not use class as a feature?

I am currently following this post, which details how BERT was trained. I had a few questions about the classification task: In the post, it mentions that the authors of BERT decided to add ...

Victor M

339

asked Sep 21, 2023 at 20:58

2 votes

0 answers

102 views

Can I calculate the significance of the number of deponent verbs with a certain feature like this?

In a language like Ancient Greek, verbal forms are marked for voice (active/middle/passive). Deponent verbs are verbs that exist only in the middle (or passive) voice, but appear to have an active ...

user396088

asked Sep 6, 2023 at 13:38

2 votes

1 answer

139 views

Gradient Clipping of Vanilla RNNs vs LSTMs

I am doing an online course that states that the reason we use LSTMs and similar variations of vanilla RNNs is because of the vanishing/exploding gradients problems with vanilla RNNs. However, an ...

HelloWorld

21

asked Sep 3, 2023 at 6:16

0 votes

1 answer

125 views

Why is the WordPiece algorithm implemented according to the maximum mutual information?

WordPiece is a subword segmentation algorithm in the field of natural language processing. Different from BPE, WordPiece will select a pair with the largest mutual information to merge each time, and ...

korangar leo

21

asked Sep 2, 2023 at 12:35

1 vote

0 answers

75 views

Does skipgram model uses backpropagation?

I just started to get interested in natural language processing and I was trying to understand the skipgram model from word2vec. I was reading this interesting website. However, in the mentioned ...

edamondo

111

asked Aug 30, 2023 at 6:56

3 votes

3 answers

622 views

Countering t-test "any feature is significant" results for large sample size datasets

I'm doing some analysis over natural language data, which basically entails: Computing some feature over all samples. Evaluating if this feature statistically significantly discriminates between ...

Andre Ye

31

asked Aug 25, 2023 at 14:32

0 votes

0 answers

74 views

How is metadata represented in sentiment analysis?

There are papers on semantic analysis using metadata such as "Sentiment Classification on Steam Reviews" (https://cs229.stanford.edu/proj2017/final-reports/5244171.pdf) and "Detecting ...

soravoid

1

asked Aug 25, 2023 at 1:28

0 votes

1 answer

1k views

What is the Llama2 number of steps? [closed]

Llama2 is pretrained with 2 trillion of tokens: $2\times10^9$, and its batch size is of $4\times 10^6$. We can calculate the number of steps (times we upgrade the parameters) per epoch as follows: $$\...

Noether

3

asked Aug 16, 2023 at 10:37

2 votes

0 answers

80 views

Does it make sense to perform Domain Adaptation before Transfer Learning?

Suppose I would like to do extractive question answering on scientific literature. I'm interested in using BERT which was pretrained on Wiki and Bookcorpus. I see two routes here: 1. Fine-tune BERT on ...

Jose Garcia

21

asked Aug 1, 2023 at 0:09

0 votes

1 answer

81 views

Considering weights right of the embeddings layer aren't used in Doc2Vec/Word2Vec, is the informative capacity of the embeddings not strongly reduced?

In an extreme (and probably impossible) example, could you not end up with all the power for the prediction being contained in the weights to the right of the embeddings layer?...and thus the matrix ...

osckt

31

asked Jul 12, 2023 at 7:23

1 vote

1 answer

183 views

How are vector values assigned initially in Word2Vec and how are they changed with iterations of the algorithm?

I am new to NLP and I'm not fully grasping how word2vec works. I understand that it aims to predict a word given its context or a context given a word but I'm not sure how the initial vector values ...

osckt

31

asked Jul 7, 2023 at 8:43

0 votes

1 answer

688 views

Using Word Embeddings in Clustering and Topic Modelling

I am new to the field of NLP and would appreciate any guidance please. I am trying to understand how word embeddings can be used in clustering and topic modelling. If I create word embeddings for ...

osckt

31

asked Jul 6, 2023 at 14:19

1 vote

1 answer

302 views

How does training word embeddings bring similar words closer together?

How does training of word embeddings lead to the clustering of similar words in the embedding space? What causes that effect?

Glue

515

asked Jun 20, 2023 at 18:41

1 vote

0 answers

277 views

How to determine EC2 instance type and memory for LLM inference endpoint [closed]

I am trying to estimate the costs required for hosting a fine tuned large language model for real time inference. There will be 100s of users querying the endpoint concurrently for multiple use cases ...

user3711946

111

asked Jun 19, 2023 at 5:58

1 vote

0 answers

63 views

Machine learning and Natural Language Processing Algorithms for Indian Surnames Homophones [closed]

Homophones Indian Surnames List English last names Can machine learning, Natural Language Processing (NLP), Artificial intelligence assist in classifying , interpreting and specifying the differences ...

Prashant Akerkar

119

asked Jun 18, 2023 at 5:01

0 votes

1 answer

75 views

Creating a morphology tagging/labeling model

I had an idea of building a model using machine learning or deep learning in order to perform morphological tagging/labeling on untagged/unlabeled data. I have a lot of tagged/labeled data (about 30,...

Dolev Mitz

1

asked Jun 14, 2023 at 3:51

1 vote

2 answers

263 views

Is concatenating a single integer sufficient for encoding positional embeddings in transformer models?

In transformer models, positional embeddings are commonly used to encode the positional information of words in a sequence. While sinusoidal positional embeddings are often employed, I'm curious about ...

Glue

515

asked Jun 8, 2023 at 13:22

1 vote

1 answer

133 views

Input non-sequential data of arbitrary size to network

I have a case where I want to feed a network with polylines of data. The problem is that the input can be any number of polylines and the polylines can consist of any number of points. If we instead ...

JakobVinkas

61

asked May 17, 2023 at 10:48

3 votes

0 answers

533 views

Attention is All You Need: How to calculate params number of the models?

I want to re-calculate the last column of Table 3 of Attention is All You Need, i.e. number of params in the models. But numbers from my calculation do not match. Model Params from Table 3 ($\times 10^...

Judd

31

asked May 17, 2023 at 2:13

1 vote

0 answers

222 views

Sentiment Analysis with a Continuos Output Labels

Problem Setting/Context: I have feedback(each feedback has multiple sentences) associated with different products(you can safely assume that a feedback talks about one single product), I need to ...

user307980

asked May 9, 2023 at 14:57

1 vote

0 answers

41 views

Accounting for edge cases without training on the test set

I'm fine-tuning a large language model to predict binary sentiment, where a false negative is far more costly for my use case than a false positive. I've used weighted cross-entropy to account for ...

multiheadedattention

21

asked May 5, 2023 at 19:55

1 vote

2 answers

540 views

Why the Transformer model does not require negative sampling but word2vec does?

Both word2vec and transformer model compute a SOFTMAX function over the words/tokens on the output side. For word2vec models, negative sampling is used for computational reasons: Is negative sampling ...

CyberPlayerOne

2,495

asked May 5, 2023 at 9:11

2 votes

0 answers

123 views

Why does the best performing adapter-based parameter-efficient fine-tuning depend on the language model being fine-tuned?

https://arxiv.org/abs/2304.01933 shows that the best performing adapter-based parameter-efficient fine-tuning depends on the language model being fine-tuned: E.g., LORA is the best adapter for LlaMa-...

Franck Dernoncourt

48.7k

asked May 5, 2023 at 8:24

1 vote

1 answer

57 views

Best Way to do Hyperparam Search and Cross-Validation

I'm making experiments to evaluate language models to brazilian portuguese datasets. So, i've made so each dataset is divided in 10 parts, I want to use cross-validation to determine the model's ...

Arthur Franco

11

asked Apr 9, 2023 at 13:26

0 votes

1 answer

289 views

Estimating exponent of Zipf distribution using MLE vs fitting linear regression on log-transformed rank and frequency data

I'm having trouble understanding why I get radically different results if I try to find the parameter of a Zipf distribution when I use the methods proposed by Clauset et al. (2009) as opposed to ...

MarcoLin8

65

asked Apr 8, 2023 at 17:34

Questions tagged [natural-language]