I have reported this issue on tensorboard's github.
Here is a repository with a full reproduction of my issue: https://github.com/AlonKellner/s3-tensorboard-issue-reproduction
Environment information
Diagnostics
Diagnostics output--- check: autoidentify
INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa
--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=10, micro=13, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='c5ff1db54ce4', release='5.15.133.1-microsoft-standard-WSL2', version='#1 SMP Thu Oct 5 21:02:42 UTC 2023', machine='x86_64')
INFO: sys.getwindowsversion(): N/A
--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None
--- check: installed_packages
INFO: installed: tensorboard==2.15.1
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
INFO: installed: tensorboard-data-server==0.7.2
--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.15.1'
--- check: tensorflow_python_version
Traceback (most recent call last):
File "//diagnose_tensorboard.py", line 511, in main
suggestions.extend(check())
File "//diagnose_tensorboard.py", line 81, in wrapper
result = fn()
File "//diagnose_tensorboard.py", line 267, in tensorflow_python_version
import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
--- check: tensorboard_data_server_version
INFO: data server binary: '/usr/local/lib/python3.10/site-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.7.2'
--- check: tensorboard_binary_path
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'
--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]
--- check: readable_fqdn
INFO: socket.getfqdn(): 'c5ff1db54ce4'
--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: .tensorboard-info directory does not exist
--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.10/site-packages']; bad_roots (0): []
--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==2.0.0
aiobotocore==2.9.0
aiohttp==3.9.1
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
botocore==1.33.13
cachetools==5.3.2
certifi==2023.11.17
charset-normalizer==3.3.2
filelock==3.13.1
frozenlist==1.4.1
fsspec==2023.12.2
google-auth==2.25.2
google-auth-oauthlib==1.2.0
grpcio==1.60.0
idna==3.6
Jinja2==3.1.2
jmespath==1.0.1
lightning==2.1.3
lightning-utilities==0.10.0
Markdown==3.5.1
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
networkx==3.2.1
numpy==1.26.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
packaging==23.2
pip==23.0.1
protobuf==4.23.4
pyasn1==0.5.1
pyasn1-modules==0.3.0
python-dateutil==2.8.2
pytorch-lightning==2.1.3
PyYAML==6.0.1
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
s3fs==2023.12.2
setuptools==65.5.1
six==1.16.0
sympy==1.12
tensorboard==2.15.1
tensorboard-data-server==0.7.2
tensorflow-io==0.35.0
tensorflow-io-gcs-filesystem==0.35.0
torch==2.1.2
torchmetrics==1.2.1
tqdm==4.66.1
triton==2.1.0
typing_extensions==4.9.0
urllib3==2.0.7
Werkzeug==3.0.1
wheel==0.42.0
wrapt==1.16.0
yarl==1.9.4
Issue description
I am trying to find a free and lightweight on-prem alternative to wandb/comet-ml/mlflow, a promising direction is using tensorboard with an s3 compatible storage.
However, the behavior of tensorboard when configured to an s3 compatible storage was unexpected, only the first experiments that the server comes across are shown in the UI.
All experiments that are present during startup are shown fully, if no experiment is present during start up, the first detected experiment will be shown partially.
After an experiment is detected and shown, no further steps and experiments will be reloaded and shown.
When using the --reload_task process option, no experiment is shown whatsoever.
I have personally reproduced this unexpected behavior with both ceph (with an on-prem instance) and minio (with a local docker image, see reproduction repo).
The expected behavior is that any new experiment that's written to the s3 compatible storage should be reloaded in the UI when pressing the reload button, as well as any new steps in that new experiment.
Also, I expect this behavior to work correctly with the --reload_multifile=true option.
Workarounds are also welcome, thanks :)