iuke-tier@ey m@iii@g oii uiow@@edu
2024-Jun-06 14:47 UTC
[Rd] clarifying and adjusting the C API for R
This is an update on some current work on the C API for use in R extensions. The internal R implementation makes use of tens of thousands of C entry points. On Linux and Windows, which support visibility restrictions, most of these are visible only within the R executble or shared library. About 1500 are not hidden and are visible to dynamically loaded shared libraries, such as ones in packages, and to embedding applications. There are two main reasons for limiting access to entry points in a software framework: - Some entry points are very easy to use in ways that corrupt internal data, leading to segfaults or, worse, incorrect computations without segfaults. - Some entry point expose internal structure and other implementation details, which makes it hard to make improvements without breaking client code that has come to depend on these details. The API of C entry points that can be used in R extensions, both for packages and embedding, has evolved organically over many years. The definition for the current release expressed in the Writing R Extensions manual (WRE) is roughly: An entry point can be used if (1) it is declared in a header file in R.home("include"), and (2) if it is documented for use in WRE. Ideally, (1) would be necessary and sufficient, but for a variety of reasons that isn't achievable, at least not in the near term. (2) can be challenging to determine; in particular, it is not amenable to a computational answer. An experimental effort is underway to add annotations to the WRE Texinfo source to allow (2) to be answered unambiguously. The annotations so far mostly reflect my reading or WRE and may be revised as they are reviewed by others. The annotated document can be used for programmatically identifying what is currently considered part of the C API. The result so far is an experimental function tools:::funAPI(): > head(tools:::funAPI()) name loc apitype 1 Rf_AdobeSymbol2utf8 R_ext/GraphicsDevice.h eapi 2 alloc3DArray WRE api 3 allocArray WRE api 4 allocLang WRE api 5 allocList WRE api 6 allocMatrix WRE api The 'apitype' field has three possible levels | api | stable (ideally) API | | eapi | experimental API | | emb | embedding API | Entry points in the embedded API would typically only be used in applications embedding R or providing new front ends, but might be reasonable to use in packages that support embedding. The 'loc' field indicates how the entry point is identified as part of an API: explicit mention in WRE, or declaration in a header file identified as fully part of an API. [tools:::funAPI() may not be completely accurate as it relies on regular expressions for examining header files considered part of the API rather than proper parsing. But it seems to be pretty close to what can be achieved with proper parsing. Proper parsing would add dependencies on additional tools, which I would like to avoid for now. One dependency already present is that a C compiler has to be on the search path and cc -E has to run the C pre-processor.] Two additional experimental functions are available for analyzing package compliance: tools:::checkPkgAPI and tools:::checkAllPkgsAPI. These examine installed packages. [These may produce some false positives on macOS; they may or may not work on Windows at this point.] Using these tools initially showed around 200 non-API entry points used across packages on CRAN and BIOC. Ideally this number should be reduced to zero. This will require a combination of additions to the API and changes in packages. Some entry points can safely be added to the API. Around 40 have already been added to WRE with API annotations; another 40 or so can probably be added after review. The remainder mostly fall into two groups: - Entry points that should never be used in packages, such as SET_OBJECT or SETLENGTH (or any non-API SETXYZ functions for that matter) that can create inconsistent or corrupt internal state. - Entry points that depend on the existence of internal structure that might be subject to change, such as the existence of promise objects or internal structure of environments. Many, if not most, of these seem to be used in idioms that can either be accomplished with existing higher-level functions already in the API, or by new higher level functions that can be created and added. Working through these will take some time and coordination between R-core and maintainers of affected packages. Once things have gelled a bit more I hope to turn this into a blog post that will include some examples of moving non-API entry point uses into compliance. Best, luke -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: stat.uiowa.edu
Thanks for working on this Luke! We appreciate your efforts to make it easier to tell what's in the exported API and we're very happy to work with you on any changes needed to tidyverse/r-lib packages. Hadley On Thu, Jun 6, 2024 at 9:47?AM luke-tierney--- via R-devel < r-devel at r-project.org> wrote:> This is an update on some current work on the C API for use in R > extensions. > > The internal R implementation makes use of tens of thousands of C > entry points. On Linux and Windows, which support visibility > restrictions, most of these are visible only within the R executble or > shared library. About 1500 are not hidden and are visible to > dynamically loaded shared libraries, such as ones in packages, and to > embedding applications. > > There are two main reasons for limiting access to entry points in a > software framework: > > - Some entry points are very easy to use in ways that corrupt internal > data, leading to segfaults or, worse, incorrect computations without > segfaults. > > - Some entry point expose internal structure and other implementation > details, which makes it hard to make improvements without breaking > client code that has come to depend on these details. > > The API of C entry points that can be used in R extensions, both for > packages and embedding, has evolved organically over many years. The > definition for the current release expressed in the Writing R > Extensions manual (WRE) is roughly: > > An entry point can be used if (1) it is declared in a header file > in R.home("include"), and (2) if it is documented for use in WRE. > > Ideally, (1) would be necessary and sufficient, but for a variety of > reasons that isn't achievable, at least not in the near term. (2) can > be challenging to determine; in particular, it is not amenable to a > computational answer. > > An experimental effort is underway to add annotations to the WRE > Texinfo source to allow (2) to be answered unambiguously. The > annotations so far mostly reflect my reading or WRE and may be revised > as they are reviewed by others. The annotated document can be used for > programmatically identifying what is currently considered part of the C > API. The result so far is an experimental function tools:::funAPI(): > > > head(tools:::funAPI()) > name loc apitype > 1 Rf_AdobeSymbol2utf8 R_ext/GraphicsDevice.h eapi > 2 alloc3DArray WRE api > 3 allocArray WRE api > 4 allocLang WRE api > 5 allocList WRE api > 6 allocMatrix WRE api > > The 'apitype' field has three possible levels > > | api | stable (ideally) API | > | eapi | experimental API | > | emb | embedding API | > > Entry points in the embedded API would typically only be used in > applications embedding R or providing new front ends, but might be > reasonable to use in packages that support embedding. > > The 'loc' field indicates how the entry point is identified as part of > an API: explicit mention in WRE, or declaration in a header file > identified as fully part of an API. > > [tools:::funAPI() may not be completely accurate as it relies on > regular expressions for examining header files considered part of the > API rather than proper parsing. But it seems to be pretty close to > what can be achieved with proper parsing. Proper parsing would add > dependencies on additional tools, which I would like to avoid for > now. One dependency already present is that a C compiler has to be on > the search path and cc -E has to run the C pre-processor.] > > Two additional experimental functions are available for analyzing > package compliance: tools:::checkPkgAPI and tools:::checkAllPkgsAPI. > These examine installed packages. > > [These may produce some false positives on macOS; they may or may not > work on Windows at this point.] > > Using these tools initially showed around 200 non-API entry points > used across packages on CRAN and BIOC. Ideally this number should be > reduced to zero. This will require a combination of additions to the > API and changes in packages. > > Some entry points can safely be added to the API. Around 40 have > already been added to WRE with API annotations; another 40 or so can > probably be added after review. > > The remainder mostly fall into two groups: > > - Entry points that should never be used in packages, such as > SET_OBJECT or SETLENGTH (or any non-API SETXYZ functions for that > matter) that can create inconsistent or corrupt internal state. > > - Entry points that depend on the existence of internal structure that > might be subject to change, such as the existence of promise objects > or internal structure of environments. > > Many, if not most, of these seem to be used in idioms that can either > be accomplished with existing higher-level functions already in the > API, or by new higher level functions that can be created and > added. Working through these will take some time and coordination > between R-core and maintainers of affected packages. > > Once things have gelled a bit more I hope to turn this into a blog > post that will include some examples of moving non-API entry point > uses into compliance. > > Best, > > luke > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tierney at uiowa.edu > Iowa City, IA 52242 WWW: stat.uiowa.edu > > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel >-- hadley.nz [[alternative HTML version deleted]]
Thanks for sharing this overview of an interesting and much-needed project. You mention that R exports about 1500 symbols (on platforms supporting visibility) but this subject isn't mentioned explicitly again in your note, so I'm wondering how things tie together. Un-exported symbols cannot be part of the API - how would people use them in this case? In a perfect world the set of exported symbols could define the API or match it exactly, but I guess that isn't the case at present. So I conclude that R exports extra (i.e. non-API) symbols. Is part of the goal to remove these extra exports? -Steve On Thu, Jun 6, 2024 at 10:47?AM luke-tierney--- via R-devel < r-devel at r-project.org> wrote:> This is an update on some current work on the C API for use in R > extensions. > > The internal R implementation makes use of tens of thousands of C > entry points. On Linux and Windows, which support visibility > restrictions, most of these are visible only within the R executble or > shared library. About 1500 are not hidden and are visible to > dynamically loaded shared libraries, such as ones in packages, and to > embedding applications. > > There are two main reasons for limiting access to entry points in a > software framework: > > - Some entry points are very easy to use in ways that corrupt internal > data, leading to segfaults or, worse, incorrect computations without > segfaults. > > - Some entry point expose internal structure and other implementation > details, which makes it hard to make improvements without breaking > client code that has come to depend on these details. > > The API of C entry points that can be used in R extensions, both for > packages and embedding, has evolved organically over many years. The > definition for the current release expressed in the Writing R > Extensions manual (WRE) is roughly: > > An entry point can be used if (1) it is declared in a header file > in R.home("include"), and (2) if it is documented for use in WRE. > > Ideally, (1) would be necessary and sufficient, but for a variety of > reasons that isn't achievable, at least not in the near term. (2) can > be challenging to determine; in particular, it is not amenable to a > computational answer. > > An experimental effort is underway to add annotations to the WRE > Texinfo source to allow (2) to be answered unambiguously. The > annotations so far mostly reflect my reading or WRE and may be revised > as they are reviewed by others. The annotated document can be used for > programmatically identifying what is currently considered part of the C > API. The result so far is an experimental function tools:::funAPI(): > > > head(tools:::funAPI()) > name loc apitype > 1 Rf_AdobeSymbol2utf8 R_ext/GraphicsDevice.h eapi > 2 alloc3DArray WRE api > 3 allocArray WRE api > 4 allocLang WRE api > 5 allocList WRE api > 6 allocMatrix WRE api > > The 'apitype' field has three possible levels > > | api | stable (ideally) API | > | eapi | experimental API | > | emb | embedding API | > > Entry points in the embedded API would typically only be used in > applications embedding R or providing new front ends, but might be > reasonable to use in packages that support embedding. > > The 'loc' field indicates how the entry point is identified as part of > an API: explicit mention in WRE, or declaration in a header file > identified as fully part of an API. > > [tools:::funAPI() may not be completely accurate as it relies on > regular expressions for examining header files considered part of the > API rather than proper parsing. But it seems to be pretty close to > what can be achieved with proper parsing. Proper parsing would add > dependencies on additional tools, which I would like to avoid for > now. One dependency already present is that a C compiler has to be on > the search path and cc -E has to run the C pre-processor.] > > Two additional experimental functions are available for analyzing > package compliance: tools:::checkPkgAPI and tools:::checkAllPkgsAPI. > These examine installed packages. > > [These may produce some false positives on macOS; they may or may not > work on Windows at this point.] > > Using these tools initially showed around 200 non-API entry points > used across packages on CRAN and BIOC. Ideally this number should be > reduced to zero. This will require a combination of additions to the > API and changes in packages. > > Some entry points can safely be added to the API. Around 40 have > already been added to WRE with API annotations; another 40 or so can > probably be added after review. > > The remainder mostly fall into two groups: > > - Entry points that should never be used in packages, such as > SET_OBJECT or SETLENGTH (or any non-API SETXYZ functions for that > matter) that can create inconsistent or corrupt internal state. > > - Entry points that depend on the existence of internal structure that > might be subject to change, such as the existence of promise objects > or internal structure of environments. > > Many, if not most, of these seem to be used in idioms that can either > be accomplished with existing higher-level functions already in the > API, or by new higher level functions that can be created and > added. Working through these will take some time and coordination > between R-core and maintainers of affected packages. > > Once things have gelled a bit more I hope to turn this into a blog > post that will include some examples of moving non-API entry point > uses into compliance. > > Best, > > luke > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tierney at uiowa.edu > Iowa City, IA 52242 WWW: stat.uiowa.edu > > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Thanks so much for your wonderful work, Luke! I didn't expect such a clarification to happen this soon. This is really great. For convenience, I created a quick web page to search the result of tools:::funAPI(). yutannihilation.github.io/R-fun-API Hope this helps those who are too lazy to install R-devel to check. Best, Yutani 2024?6?6?(?) 23:47 luke-tierney--- via R-devel <r-devel at r-project.org>:> This is an update on some current work on the C API for use in R > extensions. > > The internal R implementation makes use of tens of thousands of C > entry points. On Linux and Windows, which support visibility > restrictions, most of these are visible only within the R executble or > shared library. About 1500 are not hidden and are visible to > dynamically loaded shared libraries, such as ones in packages, and to > embedding applications. > > There are two main reasons for limiting access to entry points in a > software framework: > > - Some entry points are very easy to use in ways that corrupt internal > data, leading to segfaults or, worse, incorrect computations without > segfaults. > > - Some entry point expose internal structure and other implementation > details, which makes it hard to make improvements without breaking > client code that has come to depend on these details. > > The API of C entry points that can be used in R extensions, both for > packages and embedding, has evolved organically over many years. The > definition for the current release expressed in the Writing R > Extensions manual (WRE) is roughly: > > An entry point can be used if (1) it is declared in a header file > in R.home("include"), and (2) if it is documented for use in WRE. > > Ideally, (1) would be necessary and sufficient, but for a variety of > reasons that isn't achievable, at least not in the near term. (2) can > be challenging to determine; in particular, it is not amenable to a > computational answer. > > An experimental effort is underway to add annotations to the WRE > Texinfo source to allow (2) to be answered unambiguously. The > annotations so far mostly reflect my reading or WRE and may be revised > as they are reviewed by others. The annotated document can be used for > programmatically identifying what is currently considered part of the C > API. The result so far is an experimental function tools:::funAPI(): > > > head(tools:::funAPI()) > name loc apitype > 1 Rf_AdobeSymbol2utf8 R_ext/GraphicsDevice.h eapi > 2 alloc3DArray WRE api > 3 allocArray WRE api > 4 allocLang WRE api > 5 allocList WRE api > 6 allocMatrix WRE api > > The 'apitype' field has three possible levels > > | api | stable (ideally) API | > | eapi | experimental API | > | emb | embedding API | > > Entry points in the embedded API would typically only be used in > applications embedding R or providing new front ends, but might be > reasonable to use in packages that support embedding. > > The 'loc' field indicates how the entry point is identified as part of > an API: explicit mention in WRE, or declaration in a header file > identified as fully part of an API. > > [tools:::funAPI() may not be completely accurate as it relies on > regular expressions for examining header files considered part of the > API rather than proper parsing. But it seems to be pretty close to > what can be achieved with proper parsing. Proper parsing would add > dependencies on additional tools, which I would like to avoid for > now. One dependency already present is that a C compiler has to be on > the search path and cc -E has to run the C pre-processor.] > > Two additional experimental functions are available for analyzing > package compliance: tools:::checkPkgAPI and tools:::checkAllPkgsAPI. > These examine installed packages. > > [These may produce some false positives on macOS; they may or may not > work on Windows at this point.] > > Using these tools initially showed around 200 non-API entry points > used across packages on CRAN and BIOC. Ideally this number should be > reduced to zero. This will require a combination of additions to the > API and changes in packages. > > Some entry points can safely be added to the API. Around 40 have > already been added to WRE with API annotations; another 40 or so can > probably be added after review. > > The remainder mostly fall into two groups: > > - Entry points that should never be used in packages, such as > SET_OBJECT or SETLENGTH (or any non-API SETXYZ functions for that > matter) that can create inconsistent or corrupt internal state. > > - Entry points that depend on the existence of internal structure that > might be subject to change, such as the existence of promise objects > or internal structure of environments. > > Many, if not most, of these seem to be used in idioms that can either > be accomplished with existing higher-level functions already in the > API, or by new higher level functions that can be created and > added. Working through these will take some time and coordination > between R-core and maintainers of affected packages. > > Once things have gelled a bit more I hope to turn this into a blog > post that will include some examples of moving non-API entry point > uses into compliance. > > Best, > > luke > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tierney at uiowa.edu > Iowa City, IA 52242 WWW: stat.uiowa.edu > > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
iuke-tier@ey m@iii@g oii uiow@@edu
2024-Jun-18 14:08 UTC
[Rd] clarifying and adjusting the C API for R
Another quick update: Over 100 entry points used in packages for which it was safe to do so have now been marked as part of an API (in some cases after adding error checking of arguments). These can be used in package C code, with caveats for ones considered experimental or intended for embedded use. The remaining 100 or so non-API entry points used in packages will require changes in package C code. In some cases the API already provides safe alternatives to unsafe internal entry points. In most other cases it should be possible to develop safer interfaces that allow packages to accomplish what they need to do in a more robust way, while giving R maintainers and developers the freedom to make needed internal changes without disrupting package space. It will take some time to develop these new interfaces. 'Writing R extensions' now has a new section 'Moving into C API compliance' that should help with adapting to these changes. Best, luke On Thu, 6 Jun 2024, luke-tierney at uiowa.edu wrote:> This is an update on some current work on the C API for use in R > extensions. > > The internal R implementation makes use of tens of thousands of C > entry points. On Linux and Windows, which support visibility > restrictions, most of these are visible only within the R executble or > shared library. About 1500 are not hidden and are visible to > dynamically loaded shared libraries, such as ones in packages, and to > embedding applications. > > There are two main reasons for limiting access to entry points in a > software framework: > > - Some entry points are very easy to use in ways that corrupt internal > data, leading to segfaults or, worse, incorrect computations without > segfaults. > > - Some entry point expose internal structure and other implementation > details, which makes it hard to make improvements without breaking > client code that has come to depend on these details. > > The API of C entry points that can be used in R extensions, both for > packages and embedding, has evolved organically over many years. The > definition for the current release expressed in the Writing R > Extensions manual (WRE) is roughly: > > An entry point can be used if (1) it is declared in a header file > in R.home("include"), and (2) if it is documented for use in WRE. > > Ideally, (1) would be necessary and sufficient, but for a variety of > reasons that isn't achievable, at least not in the near term. (2) can > be challenging to determine; in particular, it is not amenable to a > computational answer. > > An experimental effort is underway to add annotations to the WRE > Texinfo source to allow (2) to be answered unambiguously. The > annotations so far mostly reflect my reading or WRE and may be revised > as they are reviewed by others. The annotated document can be used for > programmatically identifying what is currently considered part of the C > API. The result so far is an experimental function tools:::funAPI(): > > > head(tools:::funAPI()) > name loc apitype > 1 Rf_AdobeSymbol2utf8 R_ext/GraphicsDevice.h eapi > 2 alloc3DArray WRE api > 3 allocArray WRE api > 4 allocLang WRE api > 5 allocList WRE api > 6 allocMatrix WRE api > > The 'apitype' field has three possible levels > > | api | stable (ideally) API | > | eapi | experimental API | > | emb | embedding API | > > Entry points in the embedded API would typically only be used in > applications embedding R or providing new front ends, but might be > reasonable to use in packages that support embedding. > > The 'loc' field indicates how the entry point is identified as part of > an API: explicit mention in WRE, or declaration in a header file > identified as fully part of an API. > > [tools:::funAPI() may not be completely accurate as it relies on > regular expressions for examining header files considered part of the > API rather than proper parsing. But it seems to be pretty close to > what can be achieved with proper parsing. Proper parsing would add > dependencies on additional tools, which I would like to avoid for > now. One dependency already present is that a C compiler has to be on > the search path and cc -E has to run the C pre-processor.] > > Two additional experimental functions are available for analyzing > package compliance: tools:::checkPkgAPI and tools:::checkAllPkgsAPI. > These examine installed packages. > > [These may produce some false positives on macOS; they may or may not > work on Windows at this point.] > > Using these tools initially showed around 200 non-API entry points > used across packages on CRAN and BIOC. Ideally this number should be > reduced to zero. This will require a combination of additions to the > API and changes in packages. > > Some entry points can safely be added to the API. Around 40 have > already been added to WRE with API annotations; another 40 or so can > probably be added after review. > > The remainder mostly fall into two groups: > > - Entry points that should never be used in packages, such as > SET_OBJECT or SETLENGTH (or any non-API SETXYZ functions for that > matter) that can create inconsistent or corrupt internal state. > > - Entry points that depend on the existence of internal structure that > might be subject to change, such as the existence of promise objects > or internal structure of environments. > > Many, if not most, of these seem to be used in idioms that can either > be accomplished with existing higher-level functions already in the > API, or by new higher level functions that can be created and > added. Working through these will take some time and coordination > between R-core and maintainers of affected packages. > > Once things have gelled a bit more I hope to turn this into a blog > post that will include some examples of moving non-API entry point > uses into compliance. > > Best, > > luke > >-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: stat.uiowa.edu