Notes
-
The Solr query in the migration uses the cursor query method, and the
cursorMarkfrom the cursor query is recorded in a migration database and is used for restart purposes. -
The cursor query needs special permissions to the Solr indexes, i.e., the IP address of the machine in which the cursor query is made needs to be included in the white-lists of the indexes.
-
There are
project=cmip6andproject=CMIP6,project=e3smandproject=E3SM. It found that lower casecmip6and upper caseE3SMwere used by the retracted documents, so only theCMIP6ande3smwere migrated to the Globus indexes. -
Generally, the size of 500 documents for the CMIP projects is much less than the maximum size (10 MB) for ingestion. However, for
project=e3smandproject=e3sm-supplement, the size of 500 documents is larger than 10 MB, so we userows=100in the migration ofe3smande3sm-supplementprojects, and 1500 for other projects. -
Changes to the Solr document:
- change the value of the
index_nodeto us_index - remove the
urlitem if the type of document isDatsets
- change the value of the
-
Filters during the migration:
- filter out all metadata at the ORNL index if the value of
data_nodedoes not includeanl.govandllnl.govand does not equalesgf-node.ornl.gov - filter out all metadata at the LLNL index when the
source_id=E3SM-2-1
- filter out all metadata at the ORNL index if the value of
-
The metadata of
project=e3smat the ORNL index shall be discarded as the metadata was migrated to the ORNL index, but the actual files were not. But half of these metadata have been ingested to the E3SM staged index. -
The formats of the
_timestampstrings may differ as there are two APIs to generate them during the publications (YYYY-MM-DDThh:mm:ss.sssZvs.YYYY-MM-DDThh:mm:ssZ).- A legacy API for harvesting thredds catalogs and generating publication records
- the "push" publisher API (esg-publisher)
-
The solr query used in the migration only searches the default local shard with port 8983, there is another local shard with port 8995 that is only used for GFDL CMIP6 data. MX: It is not true anymore. There were CMIP5 data in the local shard with port 8995 too. So we migrated the CMIP5 and CMIP5 metadata on the shard. On Apr. 12, 2025, we found that the obs4MIPs and other projects are in the shared too!!!
-
The other shards shown in the LLNL esg_search, are the "replica" shards at LLNL and are the synced copies of the remote shards. However, the GFDL's remote shard is dead, thus, the GFDL local replica shard (port: 8995) could be considered a local shard for migration purposes.
-
The esg search using the LLNL index is the distributed query by default, however, the distributed shards do not include the ANL and ORNL solr index. The esg search using the ANL or ORNL indexes only searches their local shards even when
distrib=True. -
The Solr query string used in the migration:
- query the default shard (for ANL, LLNL and ORNL default shard with port 8983)
select?sort=id+asc&rows=500&cursorMark=*&wt=json&q=project:CMIP6&fq=_timestamp:[* TO 2025-03-16T00:00:00Z] - query the additional local shard (only for LLNL index and the shard is used to save the GFDL metadata)
select?sort=id+asc&rows=500&cursorMark=*&wt=json&q=project:CMIP6&fq=_timestamp:[* TO 2025-03-16T00:00:00Z]&shards=localhost:8995/solr/datasets
- query the default shard (for ANL, LLNL and ORNL default shard with port 8983)
-
The esg search query string used in the verfication:
-
query the default shard (for ANL, LLNL and ORNL default shard with port 8983)
search?query=project:CMIP6&limit=500&from=*&to=2025-03-16T00:00:00Z&format=application%2Fsolr%2Bjson&distrib=false -
query the additional local shard (only for LLNL index and the shard is used to save the GFDL metadata)
search?query=project:CMIP6&limit=500&from=*&to=2025-03-16T00:00:00Z&format=application%2Fsolr%2Bjson&distrib=true&shards=localhost:8995/solr
-
-
The following metadata are considered as errors or test datasets/files and shall be deleted from the public index:
- metadata with the data node is
lapdat02-p.gfdl.noaa.gov(1 dataset and 16 files) - metadata with the data node is
esgf-node.cels.anl.gov( 1 dataset and 3 files)
- metadata with the data node is
-
The master and slave Solr indexes are not the same. The slave Solr index is connected to the
esgf-node.ornl.gov. They should be synced with each other, but they did not. So we found that:- For input4MIPs project, the count of the metadata in the master solr index is less than that in the slave index. (3 datasets and 3 files less)
-
After our first migration on March 31, 2025, there were several retracts on the data node
crd-esgf-drc.ec.gc.caandesgf-data1.llnl.gov. How do we handle the retractions? -
Synchronizer design:
- It is run every 5 minutes under the crontab in a container at ORNL.
- The query time window is 5 minutes, from the 20 to 15 minutes before the start time of the synchronizer.
- Only one instance is allowed to run. So if the previous sync does not stop after 5 minutes, the new sync won't start and will wait for another 5 minutes to start till the previous sync ends.
- The UTC time in the synchronizer is obtained from the internet NTP sever (
pool.ntp.org) with two web APIs (http://worldtimeapi.org/api/timezone/Etc/UTCandhttps://timeapi.io/api/Time/current/zone?timeZone=UTC) and system time as backups. - During the test period (now to end of April), the metadata are synced to the backup index (a37bc34d-de15-493b-9221-b95b13114fd8), instead of the public index.
- If the previous sync fails, the next sync will first restart the previous sync by finding the
marker(scroll query) oroffset(post query) in the migration database. - The query of the Globus staged indexes is the scroll search as the post search cannot be used in the case of the number of records larger than 10000. We cannot assume that the number of records in the query of a 5-min sync is less than 10000.
- The batched ingest is used to make sure that the total size of each batch is less than 10 MB (the maxium size allowed by Globus ingest).
- the synchronizer will be deployed to the container at ORNL and be converted from test to production runs on May 1, 2025?