I am currently working on distributed inferencing for TPUs. I created a fresh TPU-v2-8 VM and tried to a run my pytorch-xla code on it.
VM creation command:
gcloud compute tpus tpu-vm create <vm-name> \
--project <project-id> \
--zone=us-central1-f \
--accelerator-type=v2-8 \
--version=tpu-ubuntu2204-base \
--data-disk source=<disk-config>
I am getting "Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 4 ports provided in..."
The entire stack trace is given below:
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
https://symbolize.stripped_domain/r/?trace=7fb561a969fc,7fb561a4251f&map=
*** SIGABRT received by PID 293850 (TID 293850) on cpu 35 from PID 293850; stack trace: ***
PC: @ 0x7fb561a969fc (unknown) pthread_kill
@ 0x7fb4045aa53a 1152 (unknown)
@ 0x7fb561a42520 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7fb561a969fc,7fb4045aa539,7fb561a4251f&map=abbd016d9542b8098892badc0b19ea68:7fb3f7400000-7fb4047becf0
E0119 08:55:08.340662 293850 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 08:55:08.340686 293850 coredump_hook.cc:486] RAW: Skipping coredump since rlimit was 0 at process start.
E0119 08:55:08.340695 293850 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 08:55:08.340702 293850 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 08:55:08.340735 293850 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 08:55:08.340758 293850 coredump_hook.cc:603] RAW: Dumping core locally.
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1705654508.342848 293845 common_lib.cc:822] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 4 ports provided in `tpu_process_addresses`=localhost:8476,localhost:8477,localhost:8478,localhost:8479.
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:539
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1705654508.361102 293849 common_lib.cc:822] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8478 in any of the 4 ports provided in `tpu_process_addresses`=localhost:8476,localhost:8477,localhost:8478,localhost:8479.
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:539
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1705654508.408523 293848 common_lib.cc:822] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8479 in any of the 4 ports provided in `tpu_process_addresses`=localhost:8476,localhost:8477,localhost:8478,localhost:8479.
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:539
https://symbolize.stripped_domain/r/?trace=7f31902969fc,7f319024251f&map=
*** SIGABRT received by PID 293845 (TID 293845) on cpu 56 from PID 293845; stack trace: ***
PC: @ 0x7f31902969fc (unknown) pthread_kill
@ 0x7f3032faa53a 1152 (unknown)
@ 0x7f3190242520 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f31902969fc,7f3032faa539,7f319024251f&map=abbd016d9542b8098892badc0b19ea68:7f3025e00000-7f30331becf0
E0119 08:55:08.431729 293845 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 08:55:08.431761 293845 coredump_hook.cc:486] RAW: Skipping coredump since rlimit was 0 at process start.
E0119 08:55:08.431793 293845 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 08:55:08.431808 293845 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 08:55:08.431843 293845 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 08:55:08.431855 293845 coredump_hook.cc:603] RAW: Dumping core locally.
https://symbolize.stripped_domain/r/?trace=7f611b8969fc,7f611b84251f&map=
*** SIGABRT received by PID 293849 (TID 293849) on cpu 21 from PID 293849; stack trace: ***
PC: @ 0x7f611b8969fc (unknown) pthread_kill
@ 0x7f5fbe3aa53a 1152 (unknown)
@ 0x7f611b842520 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f611b8969fc,7f5fbe3aa539,7f611b84251f&map=abbd016d9542b8098892badc0b19ea68:7f5fb1200000-7f5fbe5becf0
E0119 08:55:08.449987 293849 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 08:55:08.450014 293849 coredump_hook.cc:486] RAW: Skipping coredump since rlimit was 0 at process start.
E0119 08:55:08.450029 293849 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 08:55:08.450041 293849 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 08:55:08.450085 293849 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 08:55:08.450100 293849 coredump_hook.cc:603] RAW: Dumping core locally.
https://symbolize.stripped_domain/r/?trace=7faeabc969fc,7faeabc4251f&map=
*** SIGABRT received by PID 293848 (TID 293848) on cpu 21 from PID 293848; stack trace: ***
PC: @ 0x7faeabc969fc (unknown) pthread_kill
@ 0x7fad4e7aa53a 1152 (unknown)
@ 0x7faeabc42520 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7faeabc969fc,7fad4e7aa539,7faeabc4251f&map=abbd016d9542b8098892badc0b19ea68:7fad41600000-7fad4e9becf0
E0119 08:55:08.498424 293848 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 08:55:08.498443 293848 coredump_hook.cc:486] RAW: Skipping coredump since rlimit was 0 at process start.
E0119 08:55:08.498452 293848 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 08:55:08.498460 293848 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 08:55:08.498490 293848 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 08:55:08.498510 293848 coredump_hook.cc:603] RAW: Dumping core locally.
E0119 08:55:08.597408 293850 process_state.cc:783] RAW: Raising signal 6 with default behavior
E0119 08:55:08.693804 293845 process_state.cc:783] RAW: Raising signal 6 with default behavior
E0119 08:55:08.711156 293849 process_state.cc:783] RAW: Raising signal 6 with default behavior
E0119 08:55:08.749858 293848 process_state.cc:783] RAW: Raising signal 6 with default behavior
Even basic code like the one given below is not working:
import torch_xla.distributed.xla_multiprocessing as xmp
def _mp_fn(index):
pass
if __name__ == "__main__":
xmp.spawn(
_mp_fn,
start_method='spawn'
)
How should I fix this?