wiki:CodeBaseCleanup

This page documents some cleanups that I (Sylvain Henry) would like to perform on GHC's code base.

Why?

  • Make the code more beginner friendly
    • Avoid acronyms
    • Hierarchical modules help in understanding the compiler structure
    • Try to correctly name things:
      • e.g. the "type checker" doesn't only check types, hence maybe we should call it "type system" or split it (e.g. Deriver, TypeChecker, etc.)
      • Avoid meaningless codename (e.g. backpack, hoopl)
  • Make the compiler more modular
    • Allow easier reuse (with the GHC API)
    • Make the compiler easier to debug
    • Make adding new passes/optimisations easier
    • Allow easier and faster testing (testing per component instead of testing the whole pipeline)
    • Allow new more interactive frontends (step-run each compiler pass and show IR, stats, etc.)
    • Allow profile guided optimizations (passes count and order, etc.)

Step 1: introduce basic module hierarchy

Implement the proposal for hierarchical module structure in GHC (#13009).

It consists only in renaming/moving modules.

Compared to the original proposal, I have:

  • Put IRs into GHC.IR and compilers into GHC.Compiler
  • changed GHC.Types into GHC.Data and GHC.Entity as the former is misleading (from a GHC API user point of view)
  • split GHC.Typecheck into GHC.IR.Haskell.{TypeChecker,Deriver}
  • split GHC.Utils into GHC.Utils and GHC.Data (e.g., Bag is in Data, not Utils)
  • etc.

Tree logic:

  • IR: intermediate representations. Each one contains its syntax and stuff manipulating it
    • Haskell
      • Syntax
      • Parser, Lexer, Printer
      • Analyser
      • TypeChecker, Renamer, Deriver
    • Core
      • Syntax
      • Analyser
      • Transformer.{Simplifier,Specialiser,Vectoriser,WorkerWrapper,FloatIn,FloatOut,CommonSubExpr, etc.}
    • Cmm
      • Syntax
      • Analyser
      • Parser, Lexer, Printer
      • Transformer.{CommonBlockElim,ConstantFolder,Dataflow,ShortCutter,Sinker}
    • Stg
      • Syntax
      • Analyser
      • Transformer.{CommonSubExpr,CostCentreCollecter,Unariser}
    • ByteCode.{Assembler,Linker...}
    • Interface.{Loader,Renamer,TypeChecker, Transformer.Tidier}
    • Llvm.{Syntax, Printer}
  • Compiler: converters between representations
    • HaskellToCore
    • CoreToStg
    • StgToCmm
    • CmmToAsm
    • CmmToLlvm
    • CoreToByteCode
    • CoreToInterface
    • CmmToC
    • TemplateToHaskell
  • Entity: entities shared by different phases of the compiler (Class, Id, Name, Unique, etc.)
  • Builtin: builtin stuff
    • Primitive.{Types,Operations}: primitives
    • Names, Types, Uniques: other wired-in stuff
  • Program: GHC-the-program (command-line parser, etc.) and its modes
    • Driver.{Phases,Pipeline}
    • Backpack
    • Make, MakeDepend
  • Interactive: interactive stuff (debugger, closure inspection, interpreter, etc.)
  • Data: data structures (Bag, Tree, etc.)
  • Config: GHC configuration
    • HostPlatform: host platform info
    • Flags: dynamic configuration (DynFlags)
    • Build: generated at build time
  • Packages: package management stuff
  • RTS: interaction with the runtime system (closure and table representation)
  • Utils: utility code or code that doesn't easily belong to another directory (e.g., Outputable, SysTools, Elf, Finder, etc.)
  • Plugin: modules to import to write compiler plugins

Actual renaming: see CodeBaseCleanup/ModuleRenaming

Issues:

  • name clashes: some modules in base (e.g. GHC.Desugar) and ghc-prim (e.g. GHC.Types) use the same GHC prefix
    • maybe we should put all GHC extensions to base under GHC.Exts.* or GHC.Base.*
    • use GHC.Builtin.Primitive.* prefix in ghc-prim?

TODO in the future:

  • Fix comments:
    • Several references to Note "Remote Template Haskell" (supposedly in libraries/ghci/GHCi/TH.hs) but it doesn't exist. Maybe replaced by Note "Remote GHCi"?
    • Undefined reference to "fill_in in PrelPack.hs" from GHC.Entity.Id
    • Undefined reference to CgConTbls.hs from GHC.Compiler.StgToCmm.Binding
    • Undefined reference to PprMach.hs from GHC.Compiler.CmmToAsm.PIC
    • Undefined reference to Renaming.hs from GHC.IR.Core.Transformer.Substitution
    • Undefined reference to simplStg/SRT.hs from GHC.IR.Cmm.Transformer.InfoTableBuilder
    • Undefined reference to codeGen/CodeGen.hs from GHC.Compiler.HaskellToCore.Foreign.Declaration
    • Undefined reference to RegArchBase.hs from GHC.Compiler.CmmToAsm.Register.Allocator.Graph.ArchX86
    • Undefined reference to MachRegs*.hs and MachRegs.hs from GHC.Compiler.CmmToAsm.Register.Allocator.Graph.ArchBase
  • Binutils 2.17 is from 2011. Maybe we could remove the Hack in GHC.Compiler.CmmToAsm.X86.CodeGen
  • Rename CAF into "static thunk"?
  • put notes files (e.g. profiling-notes, *.tex files) into actual notes or in the wiki
  • Fix traces of RnHsSyn that doesn't exist anymore
  • References to "NCG" should be replaced with reference to "CmmToAsm compiler"
  • Foreign export stubs are generated in GHC.Compiler.HaskellToCore.Foreign.Declaration...
  • Tests still reflect the old hierarchy (e.g., simplCore/should_compile) but renaming them could break other tools

Questions:

  • Why don't we use the mangled selector name ($sel:foo:MkT) in every cases (not only when we have -XDuplicateRecordFields) instead of using the ambiguous one (foo)?
    • Incidentally, partially answered yesterday (2017-06-12) on ticket #13352

Step 2: split and edit some modules

Some modules contain a lot of (unrelated) stuff. We should split them.

  • GHC.Utils (previously compiler/utils/Util.hs) contains a lot of stuff that should be split
    • Compiler configuration (ghciSupported, etc.): GHC.Config
    • List operations: GHC.Data.List{.Sort,.Fold}
    • Transitive closure: GHC.Data.Graph?
    • Edit distance and fuzzy match: GHC.Utils.FuzzyMatch?
    • Shared globals between GHC package instances: GHC.Utils.SharedGlobals?
    • Command-line parser: GHC.Utils.CmdLine
    • exactLog2 (Integer): GHC.Data.Integer (why isn't it in base?)
    • Read helpers (rational, maybe, etc.): GHC.Utils.Read?
    • doesDirNameExist, getModificationUTCTime: GHC.Utils.FilePath
    • hSetTranslit: GHC.Utils.Handle.Encoding
    • etc.
  • Split GHC.Types (was HscTypes) as it contains a lot of unrelated things
    • ModGuts/ModDetails/ModIface: move to GHC.Data.Module.*
    • Usage/Dependencies: move to GHC.Data.Module.Usage/Dependencies
  • GHC.Data.*: split
    • Split OccEnv from OccName (to harmonize with GHC.Data.Name.Env)?
    • Split ModuleEnv/ModuleSet from Module?
  • Split GHC.Data.Types (was TyCoRep)?
    • Contains many data types (TyThing, Coercion, Type, Kind, etc.)
  • Split PrettyPrint from GHC.Syntax.{Type,Expr,etc.}
  • Split GHC.IR.Core.Transform.{Simplify,SimplUtils,etc.}
  • Split GHC.Rename.ImportExport (e.g., contains "warnMissingSignature")
  • Put cmmToCmm optimisations from GHC.Compilers.CmmToAsm into GHC.IR.Cmm.Transform
  • Split type-checker solvers (class lookup, givens, wanted, etc.) (was TcSimplify, TcInteract, etc.)
  • Module name GHC.Compilers.StgToCmm.Layout seems dubious: split and rename?

Some function/type names should be modified:

  • Rename codeGen function into stgToCmm
  • Rename nativeCodeGen into cmmToAsm
  • Rename ORdList (in GHC.Data.Tree.OrdList) into TreeSomething? (misleading)
  • CorePrep (prepare Core for codegen) could use a more explicit name
  • Maybe rename GHC.Data.RepType
  • Maybe rename OccName/RdrName/Name/Id to make them more explicit (may become obsolete with "trees that grow" patch)
    • OccName: NSName (NameSpacedName)
    • RdrName: ParsedName
    • Name: UniqueName
    • Id: TypedName

Step 3: clearly separate GHC-the-program and GHC's API

  • Make the GHC API purer

Abstract file loading (i.e. pluggable Finder)

Currently the Finder assumes that a filesystem exists into which it can find some packages/modules.

I would like to add support for module sources that are only available in memory or that can be retrieved from elsewhere (network, etc.).

Something similar to Java's class loaders.

Abstract error reporting and logging (i.e. pluggable Logger)

Allow new frontends (using GHC API) to use HTML reporting, etc.

  • Avoid dumping to the filesystem and/or stdout/stderr
  • Use data types instead of raw SDoc reports

Step 4: clearly separate phases

  • split DynFlags to only pass the required info to each pass
    • e.g. only the required hooks
  • use data types to report phase statistics, intermediate representations, etc.
Last modified 5 months ago Last modified on Jun 14, 2017 11:21:24 PM