I'm using Tree-sitter to parse Python code and extract ASTs, and trying to manually traverse ASTs to infer types based on assignments and function definitions.
But I'm struggling with accurately resolving types (variables, functions, classes) due to Python's dynamic typing. Specifically, challenges arise with:
- Inferring types in dynamically typed contexts.
- Handling types from external modules/packages.
- Leveraging Python's type annotations for improved type resolution.
I just need to resolve type that I have defined in my repository.
For example:
car.py
class Car
# ...
Now in a different class
:
from car import Car
car = Car()
# ...
bmw = car
# ...
I need to know BMW is a car.
How can I successfully navigate type resolution in Python with Tree-sitter?
What approaches or algorithms can I use to accurately resolve types defined in external modules without executing the Python code?
I also need to tackle def-use (definition-use) chains to track variable assignments and their types across the codebase.
- Resolve identifiers to their definitions.
- Determine the type of each identifier (class, module, etc.).
- Extract the name of the class or module the identifier refers to.
Also I think it needs to handle control flow as well.
Do we need to implement some form of scope-graph to achieve this?
Also it needs to implement python LEGB logic somehow.
Given the dynamic nature of Python, direct type inference can be complex. You can, however, leverage Python's type annotations and incorporate a mix of static analysis techniques: that would enhance the accuracy of type resolution.
Python's type annotations provide valuable information for type inference. When traversing the AST, specifically look for nodes related to function definitions (
FunctionDef
) and variable assignments (Assign
) that contain annotations.For the setup from
tree-sitter/py-tree-sitter
, cloning https://github.com/tree-sitter/tree-sitter-python in a vendor subfolder, and try (as in this jdoodle):For dynamically typed variables or those imported from external modules, consider implementing a heuristic-based approach. You could maintain a map of known types for standard library modules and popular third-party libraries.
When encountering an import statement, check if the module and its types are in your map and apply these types when these modules' members are used. That means, for any variable assignment, function call, or other expressions involving these imported members, you infer their types based on the information previously mapped.
That approach not only aids in resolving types for standard Python types but also for types that come from external libraries, assuming you have them included in your 'known_types' map."
For function and method calls, you could analyze the return types based on the annotations in their definitions. That would require building a context or a scope map as you traverse the AST, linking functions and methods to their return types when defined. Then, when these functions or methods are called, you can infer the type based on this context:
Def-use (definition-use) analysis is indeed an important aspect of type resolution, especially in dynamically typed languages like Python. It involves tracking where variables (and more generally, symbols) are defined and where they are used throughout the code. That process is essential for understanding variable scopes, lifetimes, and types, which, in turn, helps in resolving types more accurately and detecting potential errors.
For incorporating def-use analysis into your approach with Tree-sitter for Python code, you would enhance your existing strategy by adding
To implement def-use analysis for type resolution with Tree-sitter, you can extend your existing traversal functions to maintain a context or environment that maps variable names to their definitions and types. The general idea would be:
To resolve names in Python, considering both scope and control flow, you would need more than Tree-Sitter.You would need a tree climber: tree-climber. That would help tracking where variables are declared and used within different scopes but also understanding the flow of the program to predict how variables might change over time.
Scope tracking involves keeping a record of variable definitions within their respective scopes. Each function, class, and block (e.g., loops, conditionals) introduces a new scope.
A control flow analysis would require analyzing the program's paths (e.g., loops, conditionals) to understand how variables might be re-assigned or modified. Control flow graphs (CFGs) are a common tool used for this purpose, where nodes represent statements and edges represent the flow of execution.
You would need to extend the Tree-sitter AST traversal to build a CFG by marking nodes that introduce control flow changes (e.g.,
if
statements, loops) and tracking the flow between them.bstee615/tree-climber
is an example of such an extension, but only supports C.The general idea would be:
That is just an outline for a conceptual overview. A full implementation would involve detailed handling of Python's syntax and semantics, potentially including type inference algorithms and more comprehensive static analysis tools.
Once you have constructed the CFG using
CFGNode
objects, traversing and resolving names means navigating this graph and performing analysis based on the control flow it represents: you start from entry points (like the start of a function), and follow the edges that represent control flow between nodes. During this traversal, you maintain and update a scope context to resolve names based on where you are in the flow of the program.As you traverse the CFG, you manage scope similarly to how you would in an AST traversal, but with additional considerations for the control flow. For example, entering a new block (e.g., a loop or conditional block) might create a new scope, and returning from a function call would revert to the previous scope.
You might have to do path-sensitive analysis by analyzing different paths through the CFG separately, for accurate name and type resolution in the presence of conditional logic.