Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential incomplete logic in extract_file_name Method for Handling File Names #29

Closed
AissaGeek opened this issue Oct 29, 2024 · 1 comment

Comments

@AissaGeek
Copy link

AissaGeek commented Oct 29, 2024

Description

In the extract_file_name method within the DownloadWorker class in the downloader/__init__.py file. The method is responsible for extracting the file name and extension from a download URL. However, there is an uncertainty about handling file names with a specific pattern.

Example

Here is an example of payload representing this issue

{
   "topic":"cache/a/wis2/us-noaa-synoptic/data/core/weather/surface-based-observations/synop",
   "payload":{
      "id":"585436b8-95fb-11ef-845c-e43d1a213544",
      "type":"Feature",
      "version":"v04",
      "geometry":{
         "type":"Point",
         "coordinates":[
            -101.04662,
            39.42746
         ]
      },
      "properties":{
         "data_id":"us-noaa-synoptic/data/core/weather/surface-based-observations/synop/WIGOS_0-840-0-KCBK_20241029T133500",
         "datetime":"2024-10-29T13:35:00Z",
         "pubtime":"2024-10-29T13:40:21Z",
         "integrity":{
            "method":"sha512",
            "value":"Rt3ZDweAXB6Kl6xYpLLf/DXZJU0X1SkWw+wdlh6Shb2orW96IO9I/kq09dKMc9zqsgQRS91iQC9rWTvdIIPFiQ=="
         },
         "content":{
            "encoding":"base64",
            "value":"QlVGUgAA9wQAABYAAAAAAAAAAAZuHgAH6AodDSMAAAALAAABgMGWx1AAAMoAADSAAAS0NCSwAAAAAAAAAAAAAAAP//paGhJYAAAAAAAAAAAAAAAAAAAAAP0U62Nivs0PDyVDWVGtTFuTnf///////////7Y/tWDJ////////////////////gAB//////////////////////////////////////8Aln////////////////////////A+j/v/PABr///////////////////////////////////////////////////////////////+ANzc3Nw==",
            "size":247
         },
         "wigos_station_identifier":"0-840-0-KCBK"
      },
      "links":[
         {
            "rel":"canonical",
            "type":"application/x-bufr",
            "href":"https://wis2.dwd.de/gc/24h/us-noaa-synoptic/3c0adaec-9e84-4986-a17a-429293eec998__WIGOS_0-840-0-KCBK_20241029T133500.bufr4",
            "length":247
         },
         {
            "rel":"via",
            "type":"text/html",
            "href":"https://oscar.wmo.int/surface/#/search/station/stationReportDetails/0-840-0-KCBK"
         }
      ]
   },
   "target":"surface-obs"
}

You attempt to extract filename from "href":"https://wis2.dwd.de/gc/24h/us-noaa-synoptic/3c0adaec-9e84-4986-a17a-429293eec998__WIGOS_0-840-0-KCBK_20241029T133500.bufr4". The extraction leads to a filename with value 3c0adaec-9e84-4986-a17a-429293eec998__WIGOS_0-840-0-KCBK_20241029T133500 Which I doubt to be the expected filename.

Suggested solution

In pretty much of all cases, filename can be extracted from "data_id":"us-noaa-synoptic/data/core/weather/surface-based-observations/synop/WIGOS_0-840-0-KCBK_20241029T133500".

@david-i-berry
Copy link
Member

david-i-berry commented Dec 6, 2024

There is no standardisation of the data ID and in some cases the data can contain characters such as a comma. Using the filename as extracted above was found to be more reliable. In the example given above the filename extracted is as expected / intended.

Closing due to related issue #19, further discussion can continue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants