featurebox.data package

Data tools.

Embedded data: “ele_table.csv”, “ele_megnet.json”, “ie.json”, “oe.csv”

Submodules

featurebox.data.check_data module

class featurebox.data.check_data.CheckElements(check_method: ~typing.Union[~typing.List[str], str] = 'name', func: ~typing.Callable = <function CheckElements.<lambda>>)

Bases: object

Check the element in available elements or not.

AVAILABLE_ELE_NUMBER:

(1~84) + (89, 90, 91, 92).

AVAILABLE_ELE_NAME:

(‘H’~’Bi’) + (‘Ac’, ‘Th’, ‘Pa’, ‘U’).

Parameters:
  • check_method (str) – Check by number or name of element. Optional (“name”,”number”)

  • func (callable) –

    Processing for elements. Such as for element in pymatgen:

    >>> func = lambda x: [x.Z, ]
    >>> func2 = lambda x: [x.name, ]
    

Examples

>>> ce = CheckElements.from_list(check_method="name")
>>> ce.check(["Na","Al","Ta"])
['Na', 'Al', 'Ta']
>>> ce = CheckElements.from_list(check_method="name")
>>> ce.check([["Na","Al"],["Na","Ta"]])
[['Na', 'Al'], ['Na', 'Ta']]
>>> ce.check([["Na","Al"],["Na","Ra"],["Zn","H"]])
The 1 (st,ed,th) sample ['Na', 'Ra'] is with element out of AVAILABLE_ELE_NAME
 please to check_data.py for more information.
[['Na', 'Al'], ['Zn', 'H']]
>>> ce.passed_idx()
array([0, 2], dtype=int64)

Examples

>>> ce = CheckElements.from_pymatgen_structures()
...
check(samples: List) List
Parameters:

samples (list) – Names or numbers, or list of pymatgen.Structure

Returns:

result – List of filtered structures.

Return type:

list

classmethod from_list(check_method='name', grouped='False')

Get checker for list of name or number.

classmethod from_pymatgen_structures()

Get checker for list of pymatgen.Structure.

passed_idx() ndarray

The mark for all structures, return np.ndarray index.

featurebox.data.data_sep module

class featurebox.data.data_sep.DataSameSep(data: Optional[Dict] = None, sep='-', sites_name='S', dup=3, prefix=None)

Bases: object

Settle data, dispatch data with “all” mark to each site. Make sure the values of dict are Immutable type,such as float,init. Otherwise, the stored data would change with the input data, even if later than the call of this class/function.

Examples:

>>> d1 = {"Ta-S1":{"bond1":3.4,"bond2":3.5},"Co-S2":{"bond1":3.2,"bond2":3.1},"Fe-Sall":{"bond1":3.2,"bond2":3.1}}
>>> dss = DataSameSep(d1)
>>> dss["Ta-S1"]={"bond1":3.2,"bond2":3.5} # cover the old.
>>> dss.replace({"Ta-S1":{"bond1":3.4,"bond2":3.5},"Co-S2":{"bond1":3.2,"bond2":3.1}}) # cover the old.
>>> dss.replace_entry(label="Ta",site=1,entry={"bond1":3.2,"bond2":3.5}) # cover the old.
>>> dss.update({"Ta-S1":{"bond1":3.4,"bond2":3.5},"Co-S2":{"bond1":3.2,"bond2":3.1}}) # add
>>> dss.update_entry(label="Co",site=0,entry={"bond1":3.2}) # add
>>> dss.update_entry_kv(label="Mg",site="all",key="bond1",value=3.2) # add
>>> dict_data = dss.settle()
>>> pd_data = dss.settle_to_pd(sort=True)
>>> print(pd_data)
       bond1  bond2
Co-S0    3.2    NaN
Co-S2    3.2    3.1
Fe-S0    3.2    3.1
Fe-S1    3.2    3.1
Fe-S2    3.2    3.1
Mg-S0    3.2    NaN
Mg-S1    3.2    NaN
Mg-S2    3.2    NaN
Ta-S1    3.2    3.5

Make sure the key of data are formatted by {label}-{Si or Sall} !!! and all values is dict type. The ‘S’ is the same with sites_name.

param data:

first key are formated by {label}{sep}{Si or Sall}.

type data:

(dict of dict)

param sep:

default “-“.

type sep:

(str)

param sites_name:

default “S”.

type sites_name:

(str)

param dup:

default 3.

type dup:

(int)

param prefix:

the class prefix of one batch data.

type prefix:

(str)

replace(data: Dict)

Replace dict data.

Parameters:

data (dict) – {entry_key: entry}.

replace_entry(label: str, site: Union[int, str], entry: Dict, prefix=None)

Replace entry!! This would cover the old entry.

Parameters:
  • label (str) – label name.

  • site (int) – number small than self.dup, or “all”.

  • entry (dict) – entry data.

  • prefix (str) – prefix name for batch of data.

settle(sort=False) Dict

Settle data and return a formed dict.

Parameters:

sort (bool) – sort the entry keys or not.

Returns:

data_settled – new dict.

Return type:

dict

settle_to_pd(sort=False) DataFrame

Settle data and return a formed pd.Dataframe.

Parameters:

sort (bool) – sort the entry keys or not.

Returns:

data_settled – new table.

Return type:

pd.Dataframe

spilt(prefix_label_site='') Tuple

Try to get prefix,label,site_number.

update(data: Dict)

Add dict data.

Parameters:

data (dict) – {entry_key: entry}.

update_entry(label: str, site: Union[int, str], entry: Dict, prefix=None)

Add dict data to entry.

Parameters:
  • label (str) – label name.

  • site (int) – number small than self.dup, or “all”.

  • entry (dict) – entry data.

  • prefix (str) – prefix name for batch of data.

update_entry_kv(label: str, site: Union[int, str], key: str, value: Any, prefix=None)

Add dict data to entry.

Parameters:
  • label (str) – label name.

  • site (int) – number small than self.dup, or “all”.

  • key (str) – name of property.

  • value (any) – value (float, int, str)

  • prefix (str) – prefix name for batch of data.

update_from_pd(df: Union[DataFrame, str])

Read table and update to data. The table must be the formed by self.settle_to_pd function.

if df is str, try: df = pd.read_csv(“df_name”, index_col=0).T

Parameters:

df ((pd.DataFrame,str)) –

featurebox.data.mp_access module

class featurebox.data.mp_access.MpAccess(api_key: str = 'Di28ZMunseR8vr46')

Bases: object

API for pymatgen database, access pymatgen to get data.

Examples

>>> mpa = MpAccess("Di28ZMunseR8vr57") # change yourself key.
>>> ids = mpa.get_ids({"elements": {"$in": ["Al","O"]},'nelements': {"$lt": 2, "$gte": 1}})
number 29
>>> df = mpa.data_fetcher(mp_ids=ids, mp_props=['material_id', "cif"])
Will fetch 29 inorganic compounds from Materials Project
>>> structures_list = mpa.cifs_to_structures()
...
Parameters:

api_key (str:) – pymatgen key.

cifs_to_structures(cifs: Optional[List[str]] = None) List[Structure]

Get structures from cifs

data_fetcher(mp_ids: Optional[List[str]] = None, mp_props: Optional[List[str]] = None, elasticity: bool = False) DataFrame

Fetch file from pymatgen.

prop_name=[‘band_gap’,’density’,”icsd_ids“‘volume’,’material_id’,’pretty_formula’,’elements’,”energy”, ‘efermi’,’e_above_hull’,’formation_energy_per_atom’,’final_energy_per_atom’,’unit_cell_formula’, ‘spacegroup’,’nelements‘“nsites”,”final_structure”,”cif”,”piezo”,”diel”]

Parameters:
  • mp_ids (list of str) – list of MP id of pymatgen.

  • mp_props (list of str) – prop_names

  • elasticity (bool) – obtain elasticity or not.

Returns:

properties Table.

Return type:

pandas.DataFrame

get_ids(criteria: Optional[Dict] = None)

Search id by criteria.

support_property = [‘energy’, ‘energy_per_atom’, ‘volume’, ‘formation_energy_per_atom’, ‘nsites’, ‘unit_cell_formula’,’pretty_formula’, ‘is_hubbard’, ‘elements’, ‘nelements’, ‘e_above_hull’, ‘hubbards’, ‘is_compatible’, ‘spacegroup’, ‘task_ids’, ‘band_gap’, ‘density’, ‘icsd_id’, ‘icsd_ids’, ‘cif’, ‘total_magnetization’,’material_id’, ‘oxide_type’, ‘tags’, ‘elasticity’]

Examples

>>> from itertools import combinations
>>> name_list = ["NaCl","CaCo3"]
>>> criteria = {
... 'pretty_formula': {"$in": name_list},
... 'nelements': {"$lt": 3, "$gte": 3},
... 'spacegroup.number': {"$in": [225]},
... 'crystal_system': "cubic",
... 'nsites': {"$lt": 20},
... 'formation_energy_per_atom': {"$lt": 0},
... # "elements": {"$all": "O"},
... # "piezo":{"$ne": None}
... # "elements": {"$all": "O"},
... "elements": {"$in": list(combinations(["Al", "Co", "Cr", "Cu", "Fe", 'Ni'], 5))}}

where, "$gt" >, "$gte" >=, "$lt" <, "$lte" <=, "$ne" !=, "$in", "$nin" (not in), "$or", "$and", "$not", "$nor" , "$all"

get_ids_from_web_table(path_file: Optional[str] = None) List[str]

This is method to read csv file download from web,the file name is ‘_Materials Project.csv’, which contains “Materials Id” columns.

featurebox.data.namesplit module