Hervé Pagès
2010-Oct-22 06:02 UTC
[Rd] cannot connect to an FTP server with long HELLO message
Hi, Trying to access files on the ftp server at ftp.ncbi.nih.gov will either give a time out or sometimes even a segfault on Linux. The 2 following methods give the same results: f <- url("ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/GDS10.soft.gz", open="r") download.file("ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/GDS10.soft.gz", destfile=tempfile()) I've tried one or the other method with all release versions >= 2.8 and with current R devel and they always fail to connect to this FTP server. What's particular about this FTP server is that it sends a long HELLO message before it finally sends the 220 control code. Using the Unix ftp client: ----------------------------------------------------------------------------- hpages at latitude:~$ ftp ftp.ncbi.nih.gov Connected to ftp.ncbi.nih.gov. 220- Warning Notice! This is a U.S. Government computer system, which may be accessed and used only for authorized Government business by authorized personnel. Unauthorized access or use of this computer system may subject violators to criminal, civil, and/or administrative action. All information on this computer system may be intercepted, recorded, read, copied, and disclosed by and to authorized personnel for official purposes, including criminal investigations. Such information includes sensitive data encrypted to comply with confidentiality and privacy requirements. Access or use of this computer system by any person, whether authorized or unauthorized, constitutes consent to these terms. There is no right of privacy in this system. --- Welcome to the NCBI ftp server! The official anonymous access URL is ftp://ftp.ncbi.nih.gov Public data may be downloaded by logging in as "anonymous" using your E-mail address as a password. Please see ftp://ftp.ncbi.nih.gov/README.ftp for hints on large file transfers 220 FTP Server ready. ----------------------------------------------------------------------------- This seems to cause problems to the nanoftp module (src/modules/internet/nanoftp.c) used by url() and download.file() as it doesn't seem to be able to catch the 220 control code. I'm not familiar with the nanoftp module, or with socket programming in general, or with RFC 959 (FTP protocal), so I'm not really in a position to say what's going wrong exactly in the module but it seems that increasing the value of FTP_BUF_SIZE (size of the buffer for data received from the control connection) fixes the problem. Currently this is: #define FTP_BUF_SIZE 1024 but, interestingly, *any* value > 1024 seems to fix the problem (even though the long HELLO message above is 1091 bytes). Any idea what's going on? Thanks, H. -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Prof Brian Ripley
2010-Oct-26 09:01 UTC
[Rd] cannot connect to an FTP server with long HELLO message
The example works for me (eventually: the site was very slow to respond) --- nanoftp reads the response in 1024 byte chunks and makes sense of it. We do provide debugging facilites via, say, options(internet.info=0, warn=1, warning.length=4000) which may help you debug this. Simply fiddling with the buffer size doesn't help understanding and might well break something else. The code is essentially unchanged since 2006 when inter alia the buffer size was doubled to 1024 (because libxml2 2.6.6 did), and AFAICS is essentially unchanged in the current snapshots of libxml2. I can surmise that the nanoftp C code might break if the actual control code spanned 1024-byte chunks, but it needs someone with the problem to debug in more detail. On Thu, 21 Oct 2010, Herv? Pag?s wrote:> Hi, > > Trying to access files on the ftp server at ftp.ncbi.nih.gov > will either give a time out or sometimes even a segfault on Linux. > The 2 following methods give the same results: > > f <- url("ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/GDS10.soft.gz", > open="r") > > > download.file("ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/GDS10.soft.gz", > destfile=tempfile()) > > I've tried one or the other method with all release versions >= 2.8 > and with current R devel and they always fail to connect to this > FTP server. > > What's particular about this FTP server is that it sends a long HELLO > message before it finally sends the 220 control code. > Using the Unix ftp client:Well, one of many such.> > ----------------------------------------------------------------------------- > hpages at latitude:~$ ftp ftp.ncbi.nih.gov > Connected to ftp.ncbi.nih.gov. > 220- > Warning Notice! > > This is a U.S. Government computer system, which may be accessed and used > only for authorized Government business by authorized personnel. > Unauthorized access or use of this computer system may subject violators to > criminal, civil, and/or administrative action. > > All information on this computer system may be intercepted, recorded, read, > copied, and disclosed by and to authorized personnel for official purposes, > including criminal investigations. Such information includes sensitive data > encrypted to comply with confidentiality and privacy requirements. Access > or use of this computer system by any person, whether authorized or > unauthorized, constitutes consent to these terms. There is no right of > privacy in this system. > --- > Welcome to the NCBI ftp server! The official anonymous access URL is > ftp://ftp.ncbi.nih.gov > > Public data may be downloaded by logging in as "anonymous" using your E-mail > address as a password. > > Please see ftp://ftp.ncbi.nih.gov/README.ftp for hints on large file > transfers > 220 FTP Server ready. > ----------------------------------------------------------------------------- > > This seems to cause problems to the nanoftp module > (src/modules/internet/nanoftp.c) used by url() and download.file() > as it doesn't seem to be able to catch the 220 control code. > > I'm not familiar with the nanoftp module, or with socket programming in > general, or with RFC 959 (FTP protocal), so I'm not really in a position > to say what's going wrong exactly in the module but it seems that > increasing the value of FTP_BUF_SIZE (size of the buffer for data > received from the control connection) fixes the problem. > Currently this is: > > #define FTP_BUF_SIZE 1024 > > but, interestingly, *any* value > 1024 seems to fix the problem (even > though the long HELLO message above is 1091 bytes). > > Any idea what's going on? > > Thanks, > H. > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595