Title: | Easily Download Heaps of PDF and CSV Files |
---|---|
Description: | Makes it easy to download a large number of files such as PDF files and CSV files, while automatically slowing down requests, letting you know where it is up to, and adjusting for files that have already been downloaded. |
Authors: | Rohan Alexander [aut, cre, cph] , A Mahfouz [aut] |
Maintainer: | Rohan Alexander <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2024-11-22 03:51:26 UTC |
Source: | https://github.com/rohanalexander/heapsofpapers |
The check_for_existence
function looks at the folder that you
are going to save your PDFs to and checks whether you have already got any of
them. It then suggests that you filter to remove them.
check_for_existence(data, save_names = "save_names", dir = "heaps_of")
check_for_existence(data, save_names = "save_names", dir = "heaps_of")
data |
A dataframe that contains URLs that you want to download and the names that you want to save them as. |
save_names |
The name of the column whose values should be the saved
file names where the downloaded file will be saved, |
dir |
The directory to download files to, current working directory by default. |
The data
dataframe with a column specifying whether the file has
been downloaded.
## Not run: two_pdfs <- tibble::tibble( locations_are = c("https://osf.io/preprints/socarxiv/z4qg9/download", "https://osf.io/preprints/socarxiv/a29h8/download"), save_here = c("competing_effects_on_the_average_age_of_infant_death.pdf", "cesr_an_r_package_for_the_canadian_election_study.pdf") ) heapsofpapers::get_and_save( data = two_pdfs, links = "locations_are", save_names = "save_here" ) heapsofpapers::check_for_existence(data = two_pdfs, save_names = "save_here") ## End(Not run)
## Not run: two_pdfs <- tibble::tibble( locations_are = c("https://osf.io/preprints/socarxiv/z4qg9/download", "https://osf.io/preprints/socarxiv/a29h8/download"), save_here = c("competing_effects_on_the_average_age_of_infant_death.pdf", "cesr_an_r_package_for_the_canadian_election_study.pdf") ) heapsofpapers::get_and_save( data = two_pdfs, links = "locations_are", save_names = "save_here" ) heapsofpapers::check_for_existence(data = two_pdfs, save_names = "save_here") ## End(Not run)
The get_and_save
function works with a tibble of
locations (usually URLs) and file names, and then downloads the PDF from the
location to the file name, saving as it goes, and letting you know where it is
up to. It politely waits around 5 seconds between calls to the
location, and skips locations that give an error.
get_and_save( data, links = "links", save_names = "save_names", dir = "heaps_of", bucket = NULL, delay = 5, print_every = 1, dupe_strategy = "overwrite" )
get_and_save( data, links = "links", save_names = "save_names", dir = "heaps_of", bucket = NULL, delay = 5, print_every = 1, dupe_strategy = "overwrite" )
data |
A dataframe that contains URLs that you want to download and the names that you want to save them as. |
links |
The name of the column whose values should be the URLs that you
want to download, |
save_names |
The name of the column whose values should be the saved
file names where the downloaded file will be saved, |
dir |
The directory to download files to, current working directory by default. |
bucket |
name of AWS S3 bucket to save files to. |
delay |
The number of seconds to wait between downloads, default (and minimum) is five seconds. We automatically add a bit of noise to lessen the effect on systematic processes that might be otherwise working. |
print_every |
The default is that you get a print message for every file, but you can change this. If you want to print an update for every second file then set this equal to 2, for a printed update every tenth file, set it to 10, etc. |
dupe_strategy |
There are a variety of ways of dealing with the situation where you already have some of the files downloaded. By default the function will just get them again and overwrite. However you can also specify 'ignore' in which case those files will be ignored. You can also investigate duplicates yourself using heapsofpapers::check_for_existence(). |
A print statement in the console about whether each of the links
was
saved (if not turned off by the user), and notification that the function has
finished.
## Not run: two_pdfs <- tibble::tibble( locations_are = c("https://osf.io/preprints/socarxiv/z4qg9/download", "https://osf.io/preprints/socarxiv/a29h8/download"), save_here = c("competing_effects_on_the_average_age_of_infant_death.pdf", "cesr_an_r_package_for_the_canadian_election_study.pdf") ) heapsofpapers::get_and_save( data = two_pdfs, links = "locations_are", save_names = "save_here" ) ## End(Not run)
## Not run: two_pdfs <- tibble::tibble( locations_are = c("https://osf.io/preprints/socarxiv/z4qg9/download", "https://osf.io/preprints/socarxiv/a29h8/download"), save_here = c("competing_effects_on_the_average_age_of_infant_death.pdf", "cesr_an_r_package_for_the_canadian_election_study.pdf") ) heapsofpapers::get_and_save( data = two_pdfs, links = "locations_are", save_names = "save_here" ) ## End(Not run)