Is Torch limited to running on CUDA, or does it also support NPU architectures like Ascend MDC910B? Additionally, is it possible to use multi-GPUs for large model quantization when a single 80GB GPU is insufficient to handle a model with runtime memory exceeding 80GB?