How can you distinguish between a standard library call, a third-party library call, and an API call from the repository?

395 Views Asked by At

I am working on a project where I have to determine whether a particular call or import is:

  1. From the standard library of the language (Python) I'm using. (I am already considering to use sys.stdlib_module_names for it)
  2. From a third-party library, or
  3. An API call made to some service from within the repository.

Is there an efficient way or tool that could help me quickly differentiate between these types of calls or imports? I'm primarily using Python, but methods for other languages are welcome as well.

I am working on a project where I have to collect a dataset of library calls that are made within within that repository.

I am working on a project wherein I aim to compile a dataset of function calls made within a given repository from Github.

So at first, I download any given python repository from Github.

Then my main objectives are:

  • To extract all function calls made within the target repository.
  • To gather details of these function calls, including the arguments they use.
  • For this purpose, I am employing the Python AST (Abstract Syntax Tree) parser to detect and catalogue function calls and their respective arguments.
  • My entire analysis pipeline is based within a Python script leveraging the AST module.
  • Now I have to determine which of these function calls originate from within the repository itself.

For example, if there is a call

file_b.py

def abc():
  ....

file_a.py

import numpy as np
from file_b import abc
....
def foo():
   ..
   x = np.linspace(-math.pi, math.pi, 2000)
   y = np.sin(x)
   ...
   ..
   c = abc()

I want to only capture abc (as it is defined in that repository) and not the calls from numpy module.

2

There are 2 best solutions below

4
ntg On

You can use inspect since this module seems written for your purposes in mind. A trivial way to differentiate is using the location in the disk of the library given a function using inspect by e.g. :

import os 
import inspect

# "standart" library 
import numpy as np
# some "local" library 
import cfg
# we can assign it on a variable if needed 
foo=np 
print(1, os.path.dirname(os.path.abspath(foo.__file__))) 
foo=cfg 
print(2, os.path.dirname(os.path.abspath(foo.__file__)))
print()
#we can get the module from any function
unknown_function = np.sort
the_module = inspect.getmodule(unknown_function) 
print(the_module) 
print(3, os.path.dirname(os.path.abspath(the_module.__file__)))

result is:

1 /home/datalab/workspace/conda/lib/python3.8/site-packages/numpy
2 /home/datalab/workspace/utils

<module 'numpy' from '/home/datalab/workspace/conda/lib/python3.8/site-packages/numpy/__init__.py'>
3 /home/datalab/workspace/conda/lib/python3.8/site-packages/numpy

In your case, you seem to have 3 categories.

  • The first one should/will be originating from the conda/pip installation (you may check the location of your environment using sys.executable)
  • The second from the third-party-library, that should result to a well known path prefix
  • The third would be within the project repository, which may be well known or e.g. by running subprocess.check_output(['git', 'rev-parse', '--show-toplevel']) (from within the repository).

Inspect can do a lot more than give you the location on the disk in more complex situations. Here is an example along with some code. In PythonModuleOfTheWeek there are more uses and here you can find some further examples.

A practical note: importing a module means running foreign code, so make sure you trust the code, or you run it using some sandboxed environment/manner. But how to do the later is a question on its own.

A theoretical note: In extreme cases this problem is I think undecidable. The formal proof might involve using a halting function and another non-halting. Any analysis that could discriminate between the two, would therefore solve Turing's halting problem. For our case using inspect, this means that there exist modules that importing them can take potentially forever. Practically this should not be a problem because any reasonable module should be able to be imported in reasonable time.

0
Head Wizard Locke On

Pylint from https://www.pylint.org/ provides the static analysis tool you need along with numerous Editor and IDE integrations.

Pylint output can be pushed to a text file, and you can customize the format of the output and then parse it with a customized script. Said script could isolate and flag lines of output that have to do with your 3 categories, or other categories and specifics you wish to call out of the output log.

The configuration options include standard checkers and extensions, which you can also write,

  • You can tell it to ignore specific modules (--ignored-modules)
  • You can add paths to the list of source roots (--source-roots) used to determine package namespace for modules located under the source roots
  • You can generate a graph of dependencies for a given file (--int-import-graph)
  • You can force import order to recognize a module as part of a third party library (--known-third-party)
  • And much more! See for yourself.

As mentioned above, transform modules are a type of Pylint plugin that can be tailored toward a specific module or library of framework. Additionally, custom checkers can analyse a module as a raw file stream, as a series of tokens (stream), or as an AST that works on the AST representation of the module. See: https://pylint.pycqa.org/en/latest/development_guide/how_tos/custom_checkers.html#write-a-checker and pylint plugin to warn of specific function use?

Note that, when writing your scripts, you may make use of inspect or dir() function to inspect modules to help identify where they have come from. See: https://www.javatpoint.com/list-all-functions-from-a-python-module

For example:

import module
dir(module)

Or:

from inspect import getmembers,isfunction
import stats
print(f for f in getmembers(stats) if isfunction(f1]))

You can also use regex and string parsing to examine output logs of pylint and handle them accordingly. Though I mentioned this previously, I wanted to emphasize this.

AST - abstract syntax trees - help Python applications to process trees of the Python abstract syntax grammar, A python AST can be traversed and each node and its node can be traced to a source. See these other answers for additional information pertaining to module source determination from an AST:

You can also learn more about using AST in this medium article: https://medium.com/@wshanshan/intro-to-python-ast-module-bbd22cd505f7

and also in this Pybit.es article: https://pybit.es/articles/ast-intro/