Copy Files#
Graviti Data Platform supports copy binary files across different datasets.
Graviti stores binary files in the Object Storage Services, not in the Graviti database. The database only stores the access info of the binary files. The binary file copy operation only copies the access info, the real binary files will not be copied. Which means copy will not create additional storage space of Object Storage Services.
Important
The binary files can only be copied across the datasets in the same workspace with the same storage config
Note
The copy can be understood as the linux ln
operation.
Example Code#
This example downsample the source dataset with binary files, and copy the downsampled DataFrame to a new dataset.
Get the DataFrame from the source dataset:
from graviti import Workspace
# initialize the Workspace
ws = Workplace(f"{ACCESS_KEY}")
# get the source dataset
src_ds = ws.datasets.get("source_dataset")
# get the "data" sheet from the source dataset
src_df = src_ds["data"]
The “data” sheet in the source dataset contains binary files:
>>> src_df.schema
record(
fields={
'filename': string(),
'file': file.File(),
},
)
>>> src_df
filename file
0 0000.txt RemoteFile("9cf96ce")
1 0001.txt RemoteFile("d31c5f0")
2 0002.txt RemoteFile("5f83d98")
3 0003.txt RemoteFile("272c9a9")
4 0004.txt RemoteFile("d25c42d")
5 0005.txt RemoteFile("b6e904a")
6 0006.txt RemoteFile("019fad7")
7 0007.txt RemoteFile("7100110")
8 0008.txt RemoteFile("945b3a8")
9 0009.txt RemoteFile("59a0f9a")
Downsample the source dataframe and add it into the target dataset:
>>> # create the target dataset
>>> dst_ds = ws.datasets.create("target_dataset")
>>> # use the dataframe slice feature to downsample the source dataframe
>>> dst_df = src_df.iloc[::2]
>>> dst_df
filename file
0 0000.txt RemoteFile("9cf96ce")
1 0002.txt RemoteFile("5f83d98")
2 0004.txt RemoteFile("d25c42d")
3 0006.txt RemoteFile("019fad7")
4 0008.txt RemoteFile("945b3a8")
>>> # add the downsampled dataframe to the target dataset
>>> dst_ds["downsampled"] = dst_df
Commit:
>>> dst_ds.commit("copy files from source_dataset")
Draft("#1: copy files from source_dataset") created successfully
uploading structured data: 100%|██████████████████████████| 5/5 [00:03<00:00, 1.38it/s]
uploading binary files: 100%|██████████████████████████| 5/5 [00:03<00:00, 1.38it/s]
Draft("#1: copy files from source_dataset") uploaded successfully
Draft("#1: copy files from source_dataset") committed successfully
The HEAD of the dataset after commit:
Branch("main")(
(commit_id): '913b44d7aebe43a18265c27a20d2decf',
(parent): None,
(title): 'copy files from source_dataset',
(committer): 'linjiX',
(committed_at): 2022-11-11 18:52:19+08:00
)
After commit, the downsampled dataframe with binary files is copyied to the target dataset:
>>> # read the data from the target dataset
>>> dst_ds["downsampled"]
filename file
0 0000.txt RemoteFile("9cf96ce")
1 0002.txt RemoteFile("5f83d98")
2 0004.txt RemoteFile("d25c42d")
3 0006.txt RemoteFile("019fad7")
4 0008.txt RemoteFile("945b3a8")