Upload of python environment to OD fails without error message

Dear community,
currently i’m trying to upload a custom python environment to ONE DATA to use it in the python execution context in functions.
I followed this guide Environments for Python Execution Context : Service Desk & Manuals to create the .tar file and the docker image seems to be working correctly but has a size of 3.14GB.
When i’m uploading the environment to OD the loading indicator appears, but vanishes after ~30min without giving me any information about whether it succeded or failed. Afterwards i can’t select my environment and i therefore assume that it has failed.
Are there any restrictions on how big a python environment is allowed to be and are there any oportunities to better track the upload status and get some more insight into it? Are there maybe any restrictions to the environment that are not mentioned in the guide?

Thank you!

2 Likes

Hey Lucas,

there are multiple possible limitations in the place. In general, an upload will go through this route:

ingress -> reverse-proxy -> onedata-server -> faas-server -> function-registry

as for which ones could interfere here:


First likely one is that the registry is set to only have 10Gi persistent volume by default.

This is set by the helm chart, in the function registry section:

functionRegistry:
  persistence:
    size: 10Gi

In case of a docker environment, I’m not quite sure if any limit is set for that.

Generally speaking each function takes up about 65-70M, that should give you an estimate how much free space is left if you cannot check the volume’s space directly in your environment.


The second limitation could be in the overall ingress using NGINX.

This is manually set by your environment, and from the looks of it the Hyper Hyper template uses 5G here, but I can see some environments using 10G. Additionally, the ingress have a time limit for any transfer as well, that seems to be set between 8.5m and 2h. Here it is likely that you run out of time if it is set to something like 30m.

The way these are set is by the ingress annotations in the helm chart:

ingress:
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: 5G
    nginx.ingress.kubernetes.io/proxy-read-timeout: 3600
    nginx.ingress.kubernetes.io/proxy-write-timeout: 510

Again, for a docker environment I am not quite sure how is it done, but should be similarly with the body size and timeouts, but in that case the keys are client_max_body_size, proxy_send_timeout, and proxy_read_timeout.


The third one I am not quite sure if interferes, but our reverse proxy solution with Traefik may closes the connection as well.

In this case, the config map reverse-proxy-cm has the configuration and for each middleware (faas-server-mw, function-registry-mw) there can be extra options added. While here nothing is set by default, the buffering.maxRequestBodyBytes would control the size of the request, like:

http:
  middlewares:
    faas-server-mw:
      buffering:
        # Set 4GiB for the Request max size
        maxRequestBodyBytes: 4295000000

When used, the request will try to load into memory if there is available space, if not then it will use the filesystem to forward the data. This can add additional time to the whole process as filesystem generally speaking are slow.

The timeout here can be set by the forwarding timeouts, but that does not seem to be set by default.


The fourth one is hard to track down, but for each service we do have a memory request and can have a limitation. The request is usually around 1-2GiB. Without a limit it would grow in size until the host system runs out of it, otherwise the limit is the max.

What happens is that each chunk of data or the whole request is stored in memory before it can be processes so if any of the limits are reached they can break the flow.

The limits and requests can be set in the helm chart for each affected parts of the system:

reverseProxy:
  resources:
    # here

onedata:
  server:
    resources:
      # here

functions:
  server:
    resources:
      # here

functionRegistry:
  resources:
    # here

To investigate your situation, can you please navigate to the functions panel, open the inspector (right-click -> inspect) on the networking tab, try uploading the environment again and save the request into a file and send it to me, @lukas.mueller, or @claudiu.moldovan.

Here is a Firefox as an example, but for Chrome its very similar. With Safari, you would need to open the preferences and enable the “Show Develop menu in the menu bar” within the Advanced tab.

This would tell us what kind of error occured so then we may track it down right away.

Additionally, I am curious on your Dockerfile as a 3.14GB image is quite big and not quite sure how is it possible to reach that by adding python packages :slight_smile:

Thank you for reporting this issue!

Hi Ádám,

first of all thank you for your very detailed answer. It is really nice to get such a view behind the curtain :smiley:

I will follow your instructions for investigation and send you the request file via slack.
As time has passed since initially posting my question, i have additional information to share. After a couple of tries/days the upload of the environment was succesfull, altough it took 1.3 hrs and at some point the loading indicator once again disappeared.

Altough the upload worked, the environment itself is not running, as it throws this error, independ from what request i send.

{“errors”:[“I/O error on POST request for “http://faas-server:8080/api/v1/function/ff9d30b1-7f28-4a13-a19d-5923f51500dd”: Unexpected character (’<’ (code 60)): expected a valid value (JSON String, Number, Array, Object or token ‘null’, ‘true’ or ‘false’)\n at [Source: (String)“<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>Function execution in openfaas failed: 500 Internal Server Error: “Can’t reach service for: ff9d30b1-7f28-4a13-a19d-5923f51500dd.<LF>””; line: 1, column: 2]; nested exception is com.fasterxml.jackson.core.JsonParseException: Unexpected character (’<’ (code 60)): expected a valid value (JSON String, Number, Array, Object or token ‘null’, ‘true’ or ‘false’)\n at [Source: (String)“<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>Function execution in openfaas failed: 500 Internal Server Error: “Can’t reach service for: ff9d30b1-7f28-4a13-a19d-5923f51500dd.<LF>””; line: 1, column: 2]”]}

I already compared the installed packages in my environment to the environments that are currently running on internal. Unfortunately i couldn’t find any major differences that would explain this error to me. Maybe you have another idee what would cause this error?

As for the Dockerfile: i need to have Torch in my environment which bumps the local installation of the environment to around 1GB. I honestly don’t know why it is becoming so large when creating the docker image.

Thank you very much!
@adaliszk

Hey,

thank you for the har, forwarded to the team to see if we can track something down from that. PyTorch is a heavy beast and itself does add 1-6G to docker images based on what options you complie into it. Will try to push a bit there to have a base image already so that using them would not be this painful. Do you know of any other libs that would be nice to have?

On the note of the ivocation error, we currently have a bug that we are tracking that after deploying a function it often takes a long time to have it available but we do not wait for that state correctly so executions often fail for seconds to a few minutes until the function itself finally awake.

However, there is an extra bug there that if the function has too low CPU or Memory allocation it will constantly fail to deploy which we sadly do not have a notification on the UI yet.

A possible workaround can be to wait a minute or two after deployment and if still fails to execute then raise the CPU and Memory requests but keep it on a reasonable level.

Sadly, we will not have access to internal to debug the issue there so to progress with this we might need to migrate the image and an example code to a different instance to see more.

Best,
Ádám

Hey and thank you again!

The Dockerfile i am running consists of the librarys PyTorch, Numpy, Scikit-learn, Flask, Waitress and preferably the OD PythonSDK with all dependant libs. This setup would be really nice to have as a base image for doing ML stuff.

I will get back to you for migrating the image and debugging it on another instance!

Best,
Lucas