RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training

Question

I saved a checkpoint while training on gpu. After reloading the checkpoint and continue training I get the following error:

Traceback (most recent call last):
  File "main.py", line 140, in <module>
    train(model,optimizer,train_loader,val_loader,criteria=args.criterion,epoch=epoch,batch=batch)
  File "main.py", line 71, in train
    optimizer.step()
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

My training code is as follows:

def train(model,optimizer,train_loader,val_loader,criteria,epoch=0,batch=0):
    batch_count = batch
    if criteria == 'l1':
        criterion = L1_imp_Loss()
    elif criteria == 'l2':
        criterion = L2_imp_Loss()
    if args.gpu and torch.cuda.is_available():
        model.cuda()
        criterion = criterion.cuda()

    print(f'{datetime.datetime.now().time().replace(microsecond=0)} Starting to train..')
    
    while epoch <= args.epochs-1:
        print(f'********{datetime.datetime.now().time().replace(microsecond=0)} Epoch#: {epoch+1} / {args.epochs}')
        model.train()
        interval_loss, total_loss= 0,0
        for i , (input,target) in enumerate(train_loader):
            batch_count += 1
            if args.gpu and torch.cuda.is_available():
                input, target = input.cuda(), target.cuda()
            input, target = input.float(), target.float()
            pred = model(input)
            loss = criterion(pred,target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            ....

The saving process happened after finishing each epoch:

torch.save({'epoch': epoch,'batch':batch_count,'model_state_dict': model.state_dict(),'optimizer_state_dict':
                    optimizer.state_dict(),'loss': total_loss/len(train_loader),'train_set':args.train_set,'val_set':args.val_set,'args':args}, f'{args.weights_dir}/FastDepth_Final.pth')

I can't figure why I get this error. args.gpu == True, and I'm passing the model, all data, and loss function to cuda, somehow there is still a tensor on cpu, could anyone figure out what's wrong?

Thanks.

Seems like the issue comes from criterion(pred, target). Can you check pred.is_cuda and target.is_cuda? — Ivan
– Ivan, Commented Feb 7, 2021 at 19:49
It looks like you are calling .cuda on your model too late: this needs to be called BEFORE you initialise the optimiser. From the docs: If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call. In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used. See the docs here. — UpstatePedro
– UpstatePedro, Commented Aug 10, 2021 at 14:40

Shai · Accepted Answer · 2021-02-08 06:14:26Z

53

There might be an issue with the device parameters are on:

If you need to move a model to GPU via .cuda() , please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.
In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

answered Feb 8, 2021 at 6:14

Shai

115k39 gold badges259 silver badges398 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nir Over a year ago

adding .cuda() to the input data solved it for me: pred = model(x.cuda())

mirekphd · Accepted Answer · 2025-01-04 07:54:28Z

44

Make sure to add .to(device) [1] consistently to both the model and its inputs.

[1] where device="cpu" or device="cuda" (equivalent of .cuda())

edited Jan 4 at 7:54

mirekphd

7,2314 gold badges62 silver badges89 bronze badges

answered Sep 22, 2022 at 9:54

Shirley Ow

6635 silver badges9 bronze badges

1 Comment

alchemy Over a year ago

model = model.to(device)

ricksanchezdev · Accepted Answer · 2022-12-13 16:43:55Z

15

For me it worked adding

model.to('cuda')

right after setting my model up:

class Agent:
def __init__(self):
    self.n_game = 0
    self.epsilon = 0 # Randomness
    self.gamma = 0.9 # discount rate
    self.memory = deque(maxlen=MAX_MEMORY) # popleft()
    self.model = Linear_QNet(11,256,3)                         # here
    self.model.to('cuda')                                      # and here
    self.trainer = QTrainer(self.model,lr=LR,gamma=self.gamma)

answered Dec 13, 2022 at 16:43

ricksanchezdev

1511 silver badge2 bronze badges

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Aayush Shah · Accepted Answer · 2023-03-24 09:18:32Z

15

If you are like me who is still facing an issue, then the issue might me related with the "tokenizer". You're taking the model to the GPU but not the tokenized ids!

So, make sure you go by this:

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M")
model.to(device)

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) # This line.

Then you can safely make the inference from the model! 🎉

edited Mar 24, 2023 at 9:18

answered Mar 24, 2023 at 8:45

Aayush Shah

6105 silver badges15 bronze badges

Comments

amin jahani · Accepted Answer · 2022-11-14 05:14:09Z

4

adding two lines below resolved the issue for me on colab. (add in both saving and loading)

device = torch.device("cuda")
model.cuda()

note: if you are using google colab obviously you should set your colab runtime to GPU

edited Nov 14, 2022 at 5:14

answered Nov 14, 2022 at 5:08

amin jahani

1024 bronze badges

Comments

Aleesha s j · Accepted Answer · 2022-09-12 17:31:23Z

3

I added below code at the start of the file. It solved my issue

os.environ['CUDA_VISIBLE_DEVICES'] ='0'

answered Sep 12, 2022 at 17:31

Aleesha s j

1311 silver badge4 bronze badges

Comments

PaulMest · Accepted Answer · 2022-11-21 04:11:08Z

3

I'm going through the Fast AI 2022 course and trying to use my M1 Max. I've found that at least with some of the Fastbook code, I could set default_device(torch.device("mps")) and it would resolve my problems.

Here is a reusable snippet that I put at the top of the Jupyter Notebooks I've been dabbling in:

# Check that MPS is available
if not torch.backends.mps.is_available():
    if not torch.backends.mps.is_built():
        print("MPS not available because the current PyTorch install was not "
              "built with MPS enabled.")
    else:
        print("MPS not available because the current MacOS version is not 12.3+ "
              "and/or you do not have an MPS-enabled device on this machine.")

else:
    print("MPS is available. Setting as default device.")
    mps_device = torch.device("mps")
    default_device(mps_device)

answered Nov 21, 2022 at 4:11

PaulMest

15.3k10 gold badges57 silver badges54 bronze badges

2 Comments

Erik B Over a year ago

Works for me too, on a MacBook M1, at least for the first chapter. Are there places where it doesn't work for you @paulmest?

PaulMest Over a year ago

@ErikB yes, lots of places. I decided just to pay $10/month to get a Paperspace account. I was spending too much time yak-shaving on getting it to run on the M1. There are about 40 operations that are unsupported in PyTorch + MPS: github.com/pytorch/pytorch/issues/77764. So you're bound to hit one of them eventually.

Abdulmuiz Shaikh · Accepted Answer · 2024-05-14 10:25:17Z

1

For me moving tokenized data to gpu worked by .to("cuda")

answered May 14, 2024 at 10:25

Abdulmuiz Shaikh

133 bronze badges

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Genius Mouse · Accepted Answer · 2022-11-01 19:03:50Z

0

this answer of Shirley Ow helped me Make sure to add .to(device) to both the model and the model inputs.

img = torch.from_numpy(img).to(device) # Code in yolov7

answered Nov 1, 2022 at 19:03

Genius Mouse

112 bronze badges

Comments

li2 · Accepted Answer · 2023-01-31 05:54:55Z

0

I think after you load the model, it is no longer on GPU, try:

model = AutoModelForSequenceClassification.from_pretrained(output_dir).to(device)

answered Jan 31, 2023 at 5:54

li2

1

1 Comment

Community Over a year ago

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.

fmatt · Accepted Answer · 2023-03-05 21:37:36Z

0

This is not the case for this question but for those who are confused getting this error like me, I hadn't moved the pos_weight argument of BCEWithLogitsLoss to device! changing

criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([3]))

to

criterion = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([3]).to(device))

fixed the problem.

answered Mar 5, 2023 at 21:37

fmatt

4941 gold badge5 silver badges17 bronze badges

Collectives™ on Stack Overflow

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming training

11 Answers 11

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

2 Comments

1 Comment

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

2 Comments

1 Comment

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related